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ABSTRACT 


The  problem  of  transmitting  speech  over  communication  channels  with  smaller 
information-carrying  capacity  than  that  of  conventional  telephone  links  ia  discussed. 

Bandwidth  compression  systems  uaing  articulatory  constraints  (vocoders)  are 
described  and  this  is  followed  by  a  description  of  devices  that  analyse  the  speech 
sound  wave  in  terms  of  linguistic  units  -  machines  performing  this  taak  are  called 
automatic  speech  recognisera.  Bandwidth  economy  can  be  achieved  by  recognising  and 
transmitting  these  linguistic  units. 

The  difficulties  of  automatic  recognition  are  discussed  and  its  processes  com¬ 
pared  with  the  human  mechanism  for  speech  recognition.  It  is  suggested  that,  just  as 
in  human  speech  recognition,  the  performance  of  an  automatic  recogniaer  could  be  im¬ 
proved  by  using  information  about  the  statistics  and  the  structure  of  the  language 
as  well  as  the  uaual  acoustic  cues.  The  design  and  construction  of  a  phoneme  recog¬ 
niaer  for  putting  this  idea  to  the  teat  is  described.  The  machine  haa  three  parts: 
(1)  the  acoustic  recogniaer  for  detecting  aome  simple  phonemic  cues,  (2)  stored  know¬ 
ledge  about  the  digram  frequencies  of  these  phonemes,  and  (3)  a  device  for  aelecting 
the  phoneme  that  is  moat  likely  to  occur  in  the  light  of  both  acoustic  information 
and  of  the  relevant  digram  frequencies.  The  selection  is  indicated  on  a  typewriter. 

The  recogniaer  haa  a  repertory  of  13  phonemes:  /a:,  i:,  u: ,  t,  k,  s ,  /  ,  f, 

z,  m,  n,  1/ ,  and  deala  with  about  200  English  words,  spoken  in  isolation  and  contain¬ 
ing  only  these  phonemes. 

The  performance  was  tested  by  comparing  the  phonemes  at  the  input  with  the  out¬ 
put,  and  also  by  presenting  this  output  to  subjects  in  visual  and  in  acoustic  fornr. 

It  was  found  that  the  score  for  correctly  recognised  words  increased  from  28  per  cent 
to  43  per  cent  when  information  about  digram  frequencies  was  added  to  the  acoustic 
cues.  The  implications  of  visual  and  acoustic  presentation  of  the  output  were 
examined  and  the  effect  of  increasing  the  contextual  information  at  the  disposal  of 
the  subjects  was  tested  experimentally.  Possible  future  developments  in  this  field 
of  research  are  reviewed. 
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CHAPTER  I 


introduction:  the  bandwidth  compression  problem 


In  conventional  telephony  the  sound  ware  produced  by  the  speaker,  after  trans¬ 
formation  into  electrical  changes,  is  transmitted  along  the  line  to  the  receiver 
where  the  electrical  ware  is  re-converted  into  a  sound  ware  similar  to  that  produced 
by  the  speaker.  The  spectrum  of  speech  waves  extends  over  a  band  roughly  10,000  c,p,s, 
wide  (21)  and  the  intensity  variations  can  be  as  high  as  40  db. *  The  transmission  of 
high  quality  speech  by  conventional  means  requires  a  telephone  line  of  the  above  band¬ 
width  and  signal -to-noise  ratio,  although  in  normal  telephony  a  3500  c.p.s.  band  and  a 
dynamic  range  of  about  30  db.  has  been  found  sufficient  for  the  transmission  of  highly 
intelligible,  if  not  very  natural-sounding,  speech.  The  information-carrying  capacity 
of  such  lines  is  high.  Shannon's  (48)  formula  gives  the  channel  capacity  as 

C  -  W  log  ( 1  +  P/N) 

where  W  -  bandwidth  of  line  and  P/N  -  signal -to-noise  ratio  of  line. 

This  represents  about  133,000  bits  per  second  capacity  for  a  line  suitable  for 
high  quality  speech  transmission  and  35,000  bits  per  second  for  "telephone"  quality 
transmission.  Quite  simple  considerations,  given  below,  show  that,  at  least  theoreti¬ 
cally,  a  considerable  reduction  in  channel  capacity  requirements  should  be  possible 
without  influencing  the  intelligibility  of  the  transmitted  speech.  If  the  channel 
capacity  required  by  one  conversation  could  be  reduced,  then  by  using  a  suitable 
coding  procedure  several  conversations  could  be  transmitted  simultaneously  over  the 
same  line  that  previously  carried  only  one  conversation.  The  possible  transmission 
economies  resulting  from  such  a  system  are  the  reason  for  the  interest  in  finding 
ways  of  reducing  the  channel  capacity  required  for  speech  transmission.  Systems 
making  use  of  such  principles  are  called  bandwidth  economy  speech  transmission  systems 
and  since,  as  will  be  seen  later,  the  necessary  processes  involve  the  selective  trans¬ 
mission  of  only  certain  characteristics  of  the  original  wave  and  the  reconstruction' 
of  a  sound  wave  from  the  transmitted  properties,  such  systems  are  also  referred  to  as 
analysis-synthesis  telephone  systems. 

It  has  been  stated  above  that  telephone  lines  with  a  10,000  c.p.s.  bandwidth  and 
a  40  db.  dynamic  range  are  being  used  for  the  transmission  of  high  quality  speech. 

Such  lines  have  the  high  in format ion- carrying  capacity  already  mentioned  because  they 
are  capable  of  transmitting  distinguishably  any  one  of  the  very  large  number  of 
different**  sound  waves  that  fall  within  this  frequency  and  intensity  range.  The 
variety  of  such  sound  waves  is  very  much  greater  than  the  number  relevant  for  speech 
transmission  as  there  is  a  definite  upper  limit  to  the  variety  of  sounds  that  the 
human  vocal  organs  can  produce.  The  variety  of  possible  speech  sound  waves  is 
limited  by  the  well-known  fact  that  their  generation  is  controlled  by  the  movement  of 
a  small  number  of  organs,  the  lips,  tongue,  teeth,  soft  palate  and  vocal  cords,  all 
of  which  are  capable  of  relatively  slow  movements  only.  As  the  number  of  different 


H.  Fletehsr  (21)  atataa  that  the  range  of  intenaities  of  speech  aounds  in  noraal 
eonvsrsational  apsseh  ie  about  30  db.  and  an  additional  over-all  intensity  variation 
of  10  db.  then  makes  ths  dynamic  rangs  of  normal  apeeeh  about  40  db. 


Diffsrent"  maana  the  distinguishable  eteps  pernitted  by  the  noiae  level. 
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the  fictrth^Uth10n  lnith*  the°r!tlCal  r,equiren,ent«  for  channel  cspacifj  results  from  * 
116  V?Ca*  “5**"“  cbang*  only  gradually:  aa  a  result,  successive  speech 
8r“  °ften  hlghly  correlated  «nd  such  a  set  of  signals  represents^ 
randomlyln^Furthern  ^  “  **“  ""T  Poaaibl*  messages  followed  esch  other 

rwuire  the^ull  eh!r  ?“Ce  •  aa™mlnf.that  the  trsnsmission  of  speech  does  not 
ch“nel.c*P«ity  of  a  line  with  a  10,000  c.p.s.  bandwidth  is  given 

aLSilItl2i  fWhen^*  P°lnt“  °f  th‘^  th*  huma“  hearin*  mechanism  could  probsbK  not 

of  t^s^t"ngriU10n  reCB1Ved  “  tHe  “  Which  *  tel*ph0ne  lin*  ia 

can  b^.SlS  KCaPaCity  reqUired  tr*nsmission  of  speech-like  sound  patterns 

n  e  a  H  VrT  that  tl“  lnfo™“tion-c.rrying  characteristics  of  speech 
waves  are  determined  by  the  movement  of  five  independent  vocal  organs  (vocsl  co^ds 

p  ,  ongue,  teeth,  soft  palate),  that  information  about  their  position  ia  to  be 
Ir^ZU  *  3%  8CCUraCy  (?r  30  db-  “l^nel” to- noise  r.tioj'and  tT.t  !hSr  Li- 
ZuZ  th.tTTb  rCPreaenta  50  81gnificant  changes  per  aecond.  The  latter  figure 
kZ  o  hH  f  '  h  aYfrT  08  m“ny  flVC  P°aitiona  of  the  vocal  organs  need  be 
ro  °  Od:  rn8  T  the  phonemes  produced  at  the  relatively  high 

vocsl  organ^suZ  is  '  nl,,nerlC,,1  ValUe“  **  intent  of  each 


Iog2l000  m  10  bits 

«keT:ice.;p:^‘„v  ir"'* tor  th*  •*  •>»«  th, 

50  x  10  ■»  500  bits  per  second. 

This  gives  a  total  channel  capacity  requirement  of2,500bita  per  second  for  the  trans- 

;  "  “  "  i»  .P..ch,  .ith  »o  1...  .TiS.U I™ 

U  !  £x?ur0a  P™b»bly  under-eatimate  the  requirements  of  the  vocal  cord 

channel,  but  over-estimate  those  relating  to  the  other  vocal  organa.  The  channel 
capadty  requirements  would  be  greater  if  information  about  individual  voict  character- 
“',,ere  l°  be  transmitted  sa  well.  The  transmission  of  such  information  would 

ZlfiotreqUlr*  m“rly  *  clos<!LJPe®ification  of  the  action  of  the  vocal  cords  and 
Tb.  u  6  necfa8arily  increase  tie  channel  capacity  requirements  to  any  large  extent 

STi£7oo‘J£*“  O,2’50?““P"  '"»»d  U  *  n.<inctio»yc^^“™‘ 

the  133,000  bits  per  second  for  high  quality  transmission  and  the  35, 000  bits  per  second 

an  S  j  L  ,f.  h  telephone  line  is  restricted  to  the  transmission  of  apeech-like 
un  an  such  a  line  would  no.t  be  able  to  deal  with  any  other  type  of  sound  such  as 

ZelephoneZsJ*’  ”  ^  with?Ut  aevere  di»tortion.  The  realisation  of  such 

phone  system  requires  the  automatic  recognition  of  the  acoustic  correlates  of  all 
distingurshabie  articulatory  changes  -  that  is  to  aay  of  the  significant  characteristics 
the  speech  sound  wave  -  and  their  conversion  into  a  code  suitable  for  the  full  ex¬ 
ploitation  of  the  channel  capacity  of  the  line;  at  the  receiving  end  the  code  would  be 
re-converted  into  s .  speech  sound  wave.  Neither  the  relevant  acoustic  charaZrHtics 

tionZ  bei J°r  t'*1!7  “Ut°matJic  fraction  «e  fully  known  yet,  but  work  in  this  direc¬ 
tion  is  being  actively  pursued  with  s  view  to  making  possible  this  type  of  bandwidth 
economy  speech  transmission.  latn 


11- 


Further  bandwidth  economies  should  theoretically  be  possible  because  of  the  lin¬ 
guistic  origin  of  the  speech  sound  waves.  When  communicating,  the  speaker  first 
organises  the  message  to  be  transmitted  in  linguistic  form,  in  terms  of  the  phoneme 
sequence  and  also  stress,  intonation,  etc.  These  linguistic  units  are  translated 
successively  into  other  forms  snd  eventually  into  a  speech  sound  wave  which  is  the 
acoustic  form  of  .the  original  linguistic  code.  The  sound  wave,  on  reaching  the 
listener,  stimulates  his  hearing  mechanism  and  the  acoustic  code  of  the  transmitted 
message  is  converted  into  a  neural  one.  This  is  eventually  re-converted  into  the 
linguistic  code  in  the  listener's  brain.  It  is  this  linguistic  code  which  is  then 
interpreted  as  meaningful  information.  Ikiring  the  initial  and  final  stages  of  this 
transmission  sequence  the  message  is  encoded  in  phonemic  form  and  no  greater  channel 
capacity  should  be  necessary  for  the  transmission  of  speech  than  is  needed  for  the 
transmission  of  the  phonemes.  Taking  the  number  of  phonemes  in  English  as  40  and 
assuming  that  they  occur  at  the  rate  of  10  per  second  then  the  channel  capacity  re¬ 
quired  for  transmitting  the  phonemic  information  is 

10  log2  40  -  53  bits  per  second. 

The  phonemic  sequence  does  not,  however,  contain  as  much  speech  information  as 
the  speech  sound  wave.  Information  about  the  identity  of  the  speaker,  his  emotional 
state,  his  attitude  to  the  subject-matter,  such  things  as  emphasis,  doubt,  assertion, 
etc.  are  transmitted  by  the  speech  sound  wave,  but  not  on  the  whole  by  the  phonemic 
sequence.  It  is  possible  to  make  a  very  approximate  estimate  of  the  channel  capacity 
required  for  the  transmission  of  these  features  of  speech,  again  based  entirely  on 
theoretical  views  of  the  structure  of  English,  the  variety  and  speed  of  variation  of 
the  units  used  by  the  language  structure  for  formulating  these  aspects  of  information, 
rather  than  on  our  present  state  of  knowledge  about  which  of  these  units  can  in  fact 
be  recognised,  automatically  or  otherwise.  Intonation,  for  example,  probably  uses 
less  than  10  units  that  follow  each  other  not  faster  than  2  or  3  times  per  second  mid 
therefore  a  channel  capacity  of  10  bits  per  second  should  be  sufficient  for  its  trans¬ 
mission.  After  making  similar  estimates  for  stress  and  rhythm,  it  seems  reasonable  to 
assume  that  a  channel  capacity  of  not  more  than  100  to  150  bits  per  second  should  be 
sufficient  to  transmit  information  about  the  sequence  of  the  various  structural  units 
of  English  as  they  occur  in  normal  speech.  This  is  considerably  less  than  the  2500 
bits  per  second  estimated  for  transmitting  information  about  the  articulatory  movements 
of  speech  or  the  35, 000  to  133, 000  bits  per  second  for  transmitting  the  speech  sound 
wave  in  toto. 

It  may  be  of  interest  to  interrupt  the  main  argument  at  this  stage  to  point  out 
that  the  above  calculations  give  information  rates  for  speech  that  are  still  higher 
than  the  theoretical  minimum,  because  they  assume  that  successive  speech  units  can 
occur  in  random  order.  In  fact,  owing  to  a  variety  of  linguistic  rules,  extensive 
statistical  laws  affect  the  possible  sequences  of  units.  These  sequential  probabili¬ 
ties  can  be  exploited,  at  least  theoretically,  for  a  further  reduction  of  the  channel 
cfPtt®ity  requirements  for  speech  transmission.  Similar  arguments  apply  to  the  trans¬ 
mission  of  articulatory  information,  because  on  the  whole  the  sequence  of  articulatory 
configurations  is  also  statistically,  determined:  certain  sequences  do  not  occur  at 
all  and  others  again  have  differing  probabilities  of  occurrence.  These  statistical 
relationships  of  successive  articulations  can  be  exploited  to  reduce  information 
rates.  One  way  of  doing  this  has  been  proposed  by  C.  P.  Smith  (49). 

Returning  now  to  the  discussion  of  the  linguistic  organisation  of  speech  informa¬ 
tion,  it  seema  that  this  information,  when  expressed  in  terms  of  linguistic  units, 
requires  a  much  smaller  channel  capacity  for  transmission  than  at  the  articulatory  or 
the  acoustic  levels.  If  the  linguistic  nsture  of  speech  is  to  be  utilised  in  order  to 
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achieve  bandwidth  economy,  then  the  terminal  equipment  at  the  sending  end  must  sort 
the  input  waves  into  categories  corresponding  to  linguistic  units,  rather  than  to 
articulatory  configurations.  The  information  about  the  linguistic  units  must  be 
coded  efficiently  and  transmitted;  at  the  receiving  end  a  synthesiser  is  needed 
which  is  capable  of  generating  a  sound  wave  which  will  make  a  listener  recognise  the 
same  linguistic  unit  that  the  code  applied  to  its  input  represented.  An  idea  of  the 
nature  of  such  a  system  can  be  gained  by  considering  what  happens  when  a  speaker 
dictates  to  a  teleprinter  operator.  The  teleprinter  operator  receives  the  aound 
waves  produced  by  the  weaker  and  transforms  the  speech  information  from  the  acoustic 
into  its  linguistic  form  -  in  other  words,  he  understands  what  is  being  said  to  him. 

The  linguistic  units,  for  instance  in  the  form  of  a  sequence  of  phonemes  or  letters, 
are  then  typed  on  the  teleprinter  by  the  operator,  producing  an  electrical  code  with 
a  one-to-one  correlation  with  the  letters  being  sent.  At  the  receiving  end  another 
person  could  read  back  aloud  the  message  being  typed  out,  thereby  re- converting  the 
linguistic  (visual)  data  into  its  acoustic  form.  The  bandwidth  required  for  such  a 
transmission  system  is  relatively  small,  corresponding  to  the  small  variety  and  slow 
rate  of  change  of  the  units,  the  letters,  being  transmitted.  The  all-important  trans¬ 
formation  of  information  from  the  acoustic  to  the  linguistic  form  and  back  again  ia 
in  this  case  carried  out  by  a  human  being  at  either  end  of  the  transmission  channel. 

The  human  being  has  no  difficulty  in  carrying  out  these  transformations,  although  he 
is  not  able  to  formulate  the  rules  he  uses  for  this  process.  Machines  that  perform 
the  same  operations  as  the  above  human  being  are  called  automatic  speech  recognisers 
and  speech  synthesisers  respectively.  The  rules  for  constructing  satisfactory  de¬ 
vices  of  this  kind  are  not  known  yet  and  the  search  for  these  rules  i s  an  important 
part  of  experimental  phonetics  and  finding  them  is  an  essential  pre-requiaite  for 
the  design  of  this  type  of  bandwidth  economy  telephone  system. 

In  the  foregoing  discussion  it  has  been  pointed  out  that  two  basically  different 
types  of  constraint,  articulatory  and  linguistic,  can  be  utilised  in  the  design  of 
speech  transmission  systems  requiring  less  bandwidth  than  that  needed  for  transmitting 
the  speech  sound  wave  itself.  In  each  case  terminal  equipment  is  needed  which  sorts 
the  input  waves  into  a  number  of  classes  that  are  smaller  than  the  total  number  of 
different  waves  possible  within  the  speech  spectrum.  The  principle  of  this  sorting 
process  is  basically  different  in  the  two  methods.  One  attempts  to  derive  a  simplified 
description  of  the  speech  sound  wave  that  is  based  on  the  operation  of  the  articulatory 
mechanism  and  therefore  only  knowledge  of  articulatory-acoustic  correlations  is  needed 
and  no  linguistic  considerations  whatsoever  are  involved.  The  other  system  is  concerned 
solely  with  a  description  of  the  speech  information  in  terms  of  that  sequence  of  lin¬ 
guistic  units  on  which  the  production  of  the  sound  wave  was  baaed.  The  systems  using 
the  first  of  these  two  principles  are  called  vocodera.  They  have  been  extensively  in¬ 
vestigated  in  the  past  and  a  short  description  of  their  operation  can  be  found  in  the 
next  chapter.  This  is  followed  by  information  about  automatic  speech  recognisers  and 
about  efforts  for  using  them  in  speech  transmission  systems  of  reduced  bandwidth  -  as 
far  as  they  have  been  described  in  the  literature.  These  latter  systems  are  neither 
numerous  nor  very  successful:  unfortunately  the  characteristics  of  the  sound  patterns 
associated  with  the  various  linguistic  units  are  variable  and  overlapping  and  it  is 
not  easy  therefore  to  recognise  them  by  machine.  The  question  of  how  far  a  particular 
principle,  associated  with  the  linguistic  nature  of  the  speech  input,  can  be  used  to 
improve  the  performance  of  automatic  speech  recognition  is  discussed  next;  the  con¬ 
struction  of  an  automatic  recogniser  incorporating  this  principle  is  described,  as 
well  as  the  results  obtained  in  experiments  for  testing  its  performance  and  finally 
poasible  future  developments  in  this  field  of  research  are  reviewed.  Such  work  should 
throw  some  light  on  the  operation  of  the  human  speech  recognition  process  and  at  the 
same  time  help  in  the  design  of  bandwidth  compression  telephone  systems.  Once  such  a 
recogniser  is  achieved  it  can  be  used  as  a  terminal  converter  at  the  sending  end  of  a 
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speech  transmission  system  in  which  the  electrical  aignala  transmitted  correspond  to 
the  phoneme  sequence. 


CHAPTER  II 


SPEECH  TRANSMISSION  SYSTEMS  USING  ARTICULATORY  CONSTRAINTS:  VOCODERS 

All  vocoder  systems,  in  their  design,  take  into  consideration  the  basic  mechanism 
of  speech  sound  production,  as  put  forward  by  Homer  Dudley  (10).  In  speech  sound 
generation  the  energy  of  the  air  flow  from  the  lungs  is  converted  into  audible,  alter- 
nat ing"Sound  pressure  by  the  action  of  one  mechanism  or  another  to  be  called  the  “sound 
source*.  Principally  one  of  two  different  sound  sources  is  active.  One  of  these  is 
provided  by  the  action  of  the  larynx  which  produces  a  train  of  pulses  where  the  pulse 
repetition  rate  and  to  some  extent  also  the  shape  of  the  pulses  is  variable.  The 
other  sound  source  is  the  hiss  produced  when  the  stream  of  air  from  the  lungs  is  forced 
through  a  narrowing  of  the  vocal  tract  and  also  when  a  stream  of  fast-flowing  air  hits 
an  obstacle  like  the  teeth.  The  spectrum  of  the  hissy  sound  is  random  in  character 
(31).  These  sound  sources  may  be  active  on  their  own  or  in  combination.  The  spectrum 
of  the  sound  waves  generated  by  these  sound  sources  is  modified  by  the  acoustic  im¬ 
pedance  of  the  vocal  tract  which  in  turn  is  determined  by  the  articulatory  configura¬ 
tion,  that  is  the  position  of  lips,  teeth,  tongue  and  palate.  These  articulatory 
organs  can  move  only  relatively  slowly  and  therefore  the  spectral  changes  take  place 
at  a  correspondingly  slow  rate. 


The  first  of  the  vocoders,  the  so-called  channel  vocoder  developed  by  Dudley  (9) 
(10),  utilises  some  of  these  features  to  obtain  a  more  economical  description  of  the 
speech  wave.  The  basic  principle  of  its  operation  will  be  seen  from  Fig.  1.  At  the 


Fig*  !•  Prioelpla  e f  operation  of  Dudlty'i  cho final  rooodar* 
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sending  end  there  is  a  "voice-hiss  discriminator”  which  determines  whether  the  sound 
wave  has  been  produced  by  the  larynx  or  the  hiss  source.  The  bulk  of  the  energy  pro¬ 
duced  by  the  larynx  source  is  in  the  low  frequency  region  of  the  acoustic  spectrum, 
whilst  the  hiss  sounds  are  on  the  whole  concentrated  in  the  higher  frequency  regions 
and  the  operation  of  the  “voice-hiss  discriminator"  relies  on  this  spectral  cue. 
Whenever  laryngeal  excitation  is  indicated  the  "fundamental  frequency  detector"  pro¬ 
vides  an  additional  voltage  which  is  proportional  to  the  pulse  repetition  (larynx) 
frequency.  Dudley  used  the  conventional  zero- crossing  count  to  provide  an  indication 
of  this  frequency.  Information  about  the  spectral  envelope,  and  therefore  about  the 
position  of  the  articulators,  is  provided  by  the  10  channel  filters.  These  divide 
the  100  to  3000  c.p.s.  band  into  adjacent  sections  and  the  rectified,  smoothed  output 
of  the  filters  indicates  the  level  of  energy  falling  within  that  section  of  the  spec¬ 
trum.  Since  the  articulators  move  only  slowly,  the  spectrum  will  change  correspond- 
ingly  slowly;  hence  the  filter-rectifiers  can  be  followed  by  smoothing  filters  with 
a  low  cut-off  frequency  and  the  outputs  will  still  indicate  all  significant  changes 
in  the  spectrum.  Information  about  the  source  function,  the  fundamental  frequency 
and  about  the  output  of  each  filter  is  transmitted  along  separate  channels,  each  25 
c.p.s.  wide.  At  the  receiving  end  the  information  is  used  to  synthesise  a  sound  wave 
similar  to  that  at  the  input.  A  pulse  generator  and  a  white  noise  generator  are 
available  as  alternative  sound  sources.  A  switch,  controlled  by  the  signal  from  the 
voice-hiss  discriminator”  at  the  sending  end,  connects  one  or  other  of  these  source 
voltages  to  a  bank  of  10  filters  which  are  similar  to  the  10  analysing  filters  at  the 
sending  end.  The  frequency  of  the  pulse  generator  is  made  the  same  as  the  fundamental 
frequency  of  the  speech  wave  being  transmitted  by  making  use  of  the  information  trans¬ 
mitted  from  the  fundamental  frequency  detector.  The  synthesising  filters  divide  the 
energy  of  the  sound  source  into  10  spectral  bands.  Information  transmitted  along  the 
remaining  channels  is  used  to  control  the  output  level  of  each  synthesising  filter 
so  that  it  is  the  same  as  that  of  the  corresponding  analysing  filter.  Finally  the 
outputs  of  all  the  filters  are  added  and  applied  to  a  loudspeaker.  Dudley's  vocoder 
required  a  total  bandwidth  of  about  3UU  c.p.s.,  offering  a  10  :  1  bandwidth  com¬ 
pression  compared  with  the  normal  telephone  channel.  This  economy  in  bandwidth  is 
achieved  by  allowing  only  certain  kinds  of  source  function,  by  transmitting  only 
slow  variations  of  the  spectral  envelope  and  by  transmitting  spectral  data  averaged 
over  a  limited  number  of  frequency  intervals,  that  is,  by  making  use  of  constraints 
arising  out  of  the  nature  of  the  vocal  mechanism.  In  addition,  only  details  of  the 
amplitude  spectrum  but  not  of  the  phase  spectrum  are  transmitted;  in  this  way  a 
constraint  resulting  from  the  nature  of  the  hearing  mechanism  is  also  exploited 
(since  the  ear  does  not  make  use  of  phase  information). 

On  testing,  the  speech  transmission  efficiency  of  the  channel  vocoder  Dudley 
found  that  a  word  articulation  of  about  70%  could  be  achieved.  This  already  very 
good  performance  was  later  improved  as  a  result  of  extensive  research  carried  out 
at  the  Bell  Telephone  Laboratories  and  at  the  British  Post  Office  (4)  (29)  (54). 

The  latest  vocoders  use  18  channels,  with  cut-off  frequencies  of  about  20  c.p.s., 
requiring  a  total  bandwidth  of  about  350  c.p.s.,  a  30  db.  signal  to  noise  ratio 
and  show  a  90%  articulation  score  when  tested  with  PB  words.  * 

The  naturalness  of  the  transmitted  speech,  as  distinct  from  its  intelligibility, 
was  not  very  good.  Naturalness  is  a  sensory  dimension  which  was  at  that  time,  and 
for  that  matter  still  is,  not  well-defined  and  no  generally  accepted  methods  for 

Phonetically  balanced  word  lists  deeigned  by  the  Psycho-acoustic  Laboratories, 

Harvard  University  and  published  by  J.  Egen,  Laryngoscope,  58,  955-991,  1948. 
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measuring  it  were,  or  are,  available.  Hie  naturalneaa  of  the  ayatem  waa  not  considered 
as  good  as  that  of  conventional  telephone  systems  mid  largely  for  this  resson  the  vocoder 
is  still  not  in  commercial  use.  A  considerable  amount  of  work  has,  of  course,  been  done 
to  improve  the  naturalness  of  the  original  channel  vocoder.  Experimental  evidence 

that  "AJor  causes  of  this  lack  of  naturalness  are  incorrect  switching 

of  the  buss  and  hiss  source  generators  and  failure  of  the  fundamental  frequency  of  the 
output  to  follow  the  variations  of  the  larynx  vibration  frequency  of  the  speaker.  In 
other  words  lack  of  definition  of  the  excitation  function  rather  than  insufficient  in¬ 
formation  about  spectral  envelope  is  at  fault  and  the  remedy  lies  not  so  much  with  a 
different  design  of  channel  filter  as  with  an  improvement  of  the  buzz-hiss  switching 
and  fundamental  frequency  detector  circuits.  Several  attenpts  have  been  made  (6)  (28) 
to  improve  the  performance  of  the  fundamental  frequency  detector  circuits  without  pro¬ 
ducing  a  really  satisfactory  solution.  A  number  of  factors  make  the  measurement  of  the 
fundamental  frequency  of  the  speech  wave  difficult.  One  of  these  is  the  presence  of 
strong  harmonica,  another  is  that  the  averaging  process,  which  is  part  of  most  frequency 
meter  circuits,  smooths  out  the  cycle- by- cycle  variations  of  fundamental  frequency  that 
can  occur  in  speech  and  that  are  sometimes  significant.  Recently  an  autocorrelation 
method  has  been  tried  (27)  to  overcome  these  difficulties.  Hie  delay  which  produces  the 
greatest  value  of  autocorrelation  indicates  the  fundamental  frequency  and  the  actual 
value  of  the  autocorrelation  for  thia  delay  is  used  to  control  the  buzz-hiss  switch. 


Hie  so-called  base  band  (47)  vocoder  represents  yet  another  method  for  obtaining 
the  correct  excitation  function  at  the  receiving  end  of  the  vocoder.  As  can  be  se<» 
from  the  schematic  diagram  in  Fig.  2,  a  narrow  section  of  the  spectrum  of  the  original 
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is  transmit^d  separately.  At  the  receiving  end,  suitable  circuitry  con- 
erts  this  small  part  of  the  original  speech  spectrum  into  a  wave  having  a  uniform 

r:;"ri'nd‘Tov" the  *"'"h  ™«..  but  ,t  «>,, 

ining  the  periodicity  or  hissy  character  of  the  original  speech  wave.  This  converted 

alwavs  8v USC  d.  5°  6XClt®  th.e  channel  filters.  In  this  way  the  synthesised  output  will  J 
always  have  the  same  fundamental  frequency  as  the  original  speech  input  and  no  buzz- 

hiss  switch  is  required  because  the  excitation  function  is  part  of  the  original  speech 

beVtranrmU^dyfo  ^bt^  C*P*%  slice,of  the  original  speech  spectrum  has  to 

be  transmitted  for  obtaining  satisfactory  results,  and  therefore  a  bandwidth  of  about 

rC?U1fed  f°r  the  vocoder  as  a  whole.  This  seems  rather  a  lot  of  band- 
t.ht  \oa°r  tran8"’lt^ln5  details  of  the  excitation  function  of  speech  when  compared  with 
the  30U  c.p.s.  bandwidth  required  to  transmit  information  about  the  spectral  envelope. 

tb  5!!.  1!kn°t  ™ch  m°Lre  ***  the  theoretical  minimum  because  it  se«ns  that  one  of 
the  factors  that  affect  the  naturalness  of  speech  is  the  correct  reproduction  of  the 
sometimes  quite  large  cycle-by-cycle  changes  in  the  fundamental  frequency  of  speech. 

The  channel  vocoder  exploits  only  some  of  the  constraints  that  arise  out  of  the 
nature  of  the  speech-producing  mechanism:  the  fact  that  only  certain  kinds  of  exci¬ 
tation  function  are  possible  and  that  the  spectral  envelope  can  vary  only  at  a  rela¬ 
tively  slow  rate.  This  will  reduce  considerably  the  number  of  different  sound  patterns 
for  transmission  and  therefore  the  channel  capacity  required,  but  the  number  is  still 
much  larger  than  the  variety  that  the  vocal  organs  can  actually  generate.  A  good 
illustration  of  this  fact  is  given  by  David  (4).  He  points  out  that  the  output  of  a 
16-channel  vocoder  in  which  the  amplitudes  in  each  channel  are  quantised  to  eight  dis¬ 
crete  levels  can  represent  816  =  2™  different  spectral  envelopes;  if  the  vocoder 
produced  as  many  as  32  different  patterns  per  second  it  would  still  take  more  than 

years  to  produce  al 1  possible  patterns.  An  estimate  of  the  variety  of  sounds  that 
c  n  be  produced  by  the  vocal  organs  and,  at  the  same  time,  a  guide  for  a  method  of 
„C  a„S*lfyin?  sound  waves  into  categories  that  represent  all  possible  jpeech  waves  but 
thers  is  provided  by  the  results  of  research  on  the  acoustics  of  the  vocal  organs. 
Numerous  experiments  and  computations  (15)  (38)  have  shown  that,  in  the  majority  of 
cases  at  any  rate  and  in  particular  for  all  vowel-like  sounds,  the  spectral  envelope 

of  S*  “C°UStl^  output  of  the  vocal  organs  can  be  specified  in  terms  of  the  frequencies 
of  the  first  three  peaks  of  the  spectrum.  These  spectral  peaks  correspond  to  the 
resonances  of  the  vocal  tract  impedance  and  are  called  the  formants.  It  has  also  been 
hown  (14)  that,  as  long  as  the  formant  frequencies  are  specified,  no  further  informa¬ 
tion  is  required  for  determining  the  relative  anplitudes  of  these  formants.  It  is 
a  so  interesting  to  note  at  this  stage  that  just  as  on  the  acoustic  level  the  spectral 
envelope  can  be  specified  by  stating  the  values  of  only  three  separate  variables,  the 
three  formant  frequencies,  so  on  the  articulatory  level  the  configuration  of  the  vocal 
tract  can  be  specified  by  defining  three  variables,  namely  the  distance  from  the 
glottis  to  the  greatest  constriction  along  the  vocal  tract,  the  size  of  this  constric- 
tion  and  the  configuration  of  the  lip  opening  (13)  (14)  (51).  Once  it  has  been  estab- 
iished  that  the  spectral  envelop,  of  speech  waves  can  be  specified  just  by  the  values 
o  the  three  formant  frequencies,  then  the  possible  number  of  significantly  different 
spectral  patterns  can  be  calculated  as  long  as  it  is  known  to  what  accuracy  the  formant 
frequencies  need  be  known.  A  good  guide  to  the  accuracy  required  is  the  ability  of 
the  listener  to  detect  changes  of  formant  frequency  and  this  has  been  determined  by 
experiments  in  which  speechlike  sounds  with  a  variety  of  values  of  formant  frequency 
were  produced  using  a  terminal  analogue*  synthesiser  (16).  It  was  found  that  the 
threshol^of  discriminetion  was  about  ±3%.  This  would  mean  that  only  333  or  approxi¬ 
mately  2  different  spectral  envelopes  can  be  distinguished  when  dealing  with  vowel- 
like  sounds  as  compared  with  the  variety  of  248  patterns  that  can  be  specified  by  the 
output  of  a  channel  vocoder.  %>ecification  of  the  spectral  envelope,  of  course,  does 
not  provide  enough  information  for  reconstruction  of  the  sound  wave  itself:  the 
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also  determines  the  characteristics  of  the  source  function  .h-ilL,  4*  •  • 

iiiJ2Landl£"rirSi°o£ft5!  fundttmental  {ree^ency>  if  an;,  as  well  as  IhlTrlrlh 

IT*  ^«=T5TS  2  5^2.S 

zhh?z:T^  assii ^.r-1 

U6)  n5f  nS?  T';  ™*  .bility  h„.  fn  d,t.r»„.d  i„  .  of 

rr.i-to.r°“8i i3*  f°r  f°™nnt  ^ “uw*fT  mputud*  fL  iissr“i.irv 

ran.p  ,7  Jf  ,  h?  °rlglnal  speech  wave  can  be  generated  using  s  suitable  synthesiser 

»  ..t,,;,;it!!j  *‘“s  th' iiv'  n»  .^r,i  : 

usually  a  terminal  analogue  of  the  human  vocal  tract.  There  is  a  pulse  generator 
and  a  hiss  generator  to  provide  alternative  source  functions,  and  they  are  connected 

chsn^l  i  urmin?  reSOnant  ci^uits  by  the  buzz-hiss  switch.  Just  as  in  the 

vocoder  the  value  of  the  pulse  frequency  snd  the  state  of  the  buzz-hiss  switch 

cCs  oHhe  tuned  lnf°rmatl°n  transmitted  from  the  siding  end.  The  resonant  frequen- 

transmi tted  1  th  ’  "*  and  each  is  ™riad-  using  the  information 

*l  d  fr°m  *he  sendmg  end,  to  correspond  with  one  of  the  three  formant  frequen- 

always  be  at^htT  lnP’Jt'  In  *hlS  Way  the  Pesks  °f  the  spectrum  of  the  output  will 
solalr  -l  T  fret^uencles  as  the  formants  of  the  speech  input.  The  three  re- 

series  hTtL'a^i^r”60^'"/"16 ‘ ^  P8'811*1’  If  they  are  connected  in 

lllrll'  S V  amplitudes  of  the  formant  peaks  are  automatically  adjusted  as  the 

all  L  Vr°m,0ne  resonant  circuit  to  the  other  (14),  snd  information  about  over- 
not  V Vs  Tu  ,  ^  thC  Par8llel  connection  the  resonant  circuits  do 

tude  °  heC  therefore  separate  information  is  required  about  the  ampli¬ 

tude  of  each  formant.  Therefore  fewer  transmission  channels  are  required  with  the 

frer™V°nnenT;  8u  thC  S8me  timC  80  e"0r  in  transmitting  the  vsl^e  of  a  formant 

sys teethe Wtransmi 1 1 a J  *  aerlei 9  8ya ™ «  than  a  parallel  one  because  in  the  former 

f  *  L  VaiUCS  °f  the  f°rmant  frecluencies  also  control  the  amplitudes 

oi  the  formant  peaks  in  the  output. 

mi,*/  8che”atlc  diagrfm  showing  the  principal  constituents  of  such  speech  trans- 
wta?  Sy8tems>  usually  called  resonance  or  formant  vocoders,  is  shown  in  Fig.  3. 
Several  resonance  vocoder  systems  have  been  tried  (1)  (18)  (32)  (50);  they  all  use 
he  same  basic  principles  although  the  circuitry  which  implements  these  principles 

Zut\rna  *  At  th<5  Sending  Cnd  the  el-tronics  for  extracting  information 

about  the  source  function  is  similar  to  that  used  in  the  channel  vocoder.  The 

V frequencies  are  determined  by  first  dividing  the  spectrum  into  three  bands 

of  S  J  S  S'  ,£°!:man^8  sre  ejected.  The  value  of  the  spectral  peak  in  each 

finding  thSS  '^  S"  f°Und  Cither  by  8ub-dividing  by  further  filtering  and 
finding  the  filter  with  the  maximum  output  or  by  a  method  based  on  finding  the 

erage  value  of  the  rate  of  zero-crossings.  At  the  receiving  end  Miller  capacities, 
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variable  inductances  (increductors)  and  many  other  devices  have  been  used  to  obtain 
variable  tuning  for  the  resonant  circuits.  A  typical  resonance  vocoder  circuit 
gives  70%  to  80%  articulation  with  PB  words  and  needs  a  bandwidth  of  100  to  150  c.p.s. 

Although  these  figures  show  that  resonance  vocoders  transmit  speech  with  good 
intelligibility  they  are  afflicted  with  the  same  lack  of  naturalness  as  the  channel 
vocoder.  This  is  not  surprising  as  they  use  the  same  circuits  for  dealing  with  the 
excitation  function  as  the  channel  vocoder.  Consequently  the  base  band  principle, 
which  gave  good  results  with  the  channel  vocoder,  has  also  been  tried  with  the  re¬ 
sonance  vocoder  (20).  The  schematic  diagram  of  such  a  device  is  shown  in  Fig.  4. 

It  requires  a  bandwidth  of  about  550  c.p.s.  and  combines  80%  intelligibility  for  PB 
words  with  naturalness  that  is  noticeably  better  than  that  of  the  conventional 
channel  vocoder  and  at  the  same  time  still  provides  a  bandwidth  compression  of 
about  5  or  6  to  1. 

It  may  be  of  interest  to  diverge  at  this  point  and  explain  that  although  the 
resonance  vocoder  seems  to  be  able  to  transmit  most  speech  sounds  with  a  reasonable 
degree  of  intelligibility,  it  is  still  primarily  a  system  suitable  for  the  trans¬ 
mission  of  vowel-like  sounds.  This  is  because  at  the  sending  end  the  sound  waves 
to  be  transmitted  are  categorised  in  terms  of  the  first  three  spectral  peaks  and 
similarly  at  the  receiving  end  a  terminal  analogue  synthesiser  is  used  whose  output 
is  characterised  by  the  corresponding  three  spectral  maxima.  It  is  only  the  vowel¬ 
like  sounds,  in  other  words  sounds  generated  with  the  sound  source  at  the  far  end 
of  the  vocal  tract  and  with  the  vocal  tract  consisting  of  a  single  tube  without 
aide  branches,  that  can  be  described  in  terms  of  spectral  maxima  only  (15).  As  soon 
as  the  sound  source  moves  further  forward  or  when  side  branches  of  the  vocal  tract 
come  into  play,  as  is  the  case  for  the  generation  of  fricatives  and  nasals,  the 
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spectrum  of  the  sound  wave  generated  is  characterised  by  minima  as  well  as  by  maxima. 

No  adequate  methods  of  analysis  are  available  as  yet  which  would  furnish  the  character¬ 
istics  of  these  spectra.  Also,  the  terminal  analogue  synthesiser  in  the  form  in  which 
it  is  being  used  in  existing  resonance  vocoders  could  not  generate  the  corresponding 
sounds  and  a  vocal  tract  analogue  type  synthesiser  would  be  needed. 

If  a  vocal  tract  analogue  is  to  be  used  then  it  may  be  more  convenient  to  control 
its  adjustment  in  articulatory  terms  rather  than  acoustic  ones,  that  is  by  specifying 
the  tongue  and  lip  positions,  etc.,  to  which  it  is  to  be  adjusted  rather  than  the 
spectrum  of  the  resulting  sound  wave.  This  is  only  possible  if  the  acoustic  information 
about  the  speech  sound  wave  can  first  be  interpreted  in  articulatory  terms  and  vice 
versa.  Published  information  for  performing  such  a  conversion  is  as  yet  available  for 
certain  classes  of  sound  only  (51)  (52)  and  transmission  systems  of  this  kind  have  not 
been  tried. 

One  more  vocoder  has  to  be  described,  the  pattern  vocoder  (49),  which  is  only 
now  being  developed.  It  attempts  to  reduce  the  channel  capacity  required  for  speech 
transmission  by  ensuring  that  only  those  spectral  patterns  that  actually  do  occur  in 
speech  can  be  transmitted:  those  spectral  patterns  that  in  preliminary  experiments 
have  been  found  to  contribute  to  speech  intelligibility  and  are  stored  for  use  in 
the  transmission  system.  The  speech  input  is  first  of  all  applied  to  a  bank  of 
filters,  just  as  in  the  channel  vocoder.  The  spectral  envelope  represented  by  the 
output  of  these  filters  is  sampled  50  times  per  second  and  the  samples  are  then  com¬ 
pared  with  each  of  the  stored  patterns.  The  identity  of  whichever  stored  pattern 
matches  best  with  the  spectrum  of  the  input  is  indicated  and  a  corresponding  serial 
number  is  transmitted.  At  the  receiving  end  this  serial  number  is  used  to  select  a 
set  of  stored  control  instructions  which  adjust  a  synthesiser  so  as  to  produce  a 
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sound  ware  with  the  corresponding  spectral  enrelope.  This  process  is  repeated  for 
each  sanple  of  the  input  spectrum.  The  analyser  at  the  sending  end  also  provides 
the  usual  information  about  the  sound  source,  buzz-hiss  distinction,  value  of  the 
fundamental  frequency  and  amplitude  and  this  information  ia  transmitted  and  used 
at  the  receiving  end  to  control  the  operation  of  the  synthesiser.  It  is  hoped  that 
not  more  than  4,000  different  qiectral  patterns  will  have  to  be  stored  to  make 
possible  the  transmission  of  intelligible  speech  and  that  the  input  apectrum  would 
have  to  be  sampled  not  more  than  about  50  times  per  second;  this  represents  an  in¬ 
formation  rate  of  600  bits  per  second.  The  system  as  described  so  far  can  transmit 
any  sequence  of  these  stored  patterns  and  does  not  exploit  the  fact  that  in  speech 
these  spectral  patterns  do  not  occur  in  random  sequences.  Further  bandwidth  economy 
may  be  possible  if  in  addition  to  the  store  of  spectral  patterns  already  described 
the  tequencet  of  spectral  patterns  that  have  been  observed  in  speech  are  also  stored. 
The  patterns  recognised  are  not  immediately  transmitted  but  remembered  and  the 
observed  pattern- sequences  are  compared  with  the  stored  sequences;  the  identity  of 
the  stored  sequence  that  agrees  beat  with  the  input  sequence  is  indicated  and  trans¬ 
mitted.  At  the  receiving  end  this  information  is  used  to  select  one  of  a  set  of 
stored  control  instructions  which  will  adjust,  the  synthesiser  to  reproduce  the  cor¬ 
rect  sound  sequence  at  the  sending  end.  Experiments  have  yet  to  be  carried  out  to 
find  how  many  spectral  patterns,  how  many  patterns  per  pattern  sequence  and  how  many 
pattern  sequences  are  needed  for  satisfactory  speech  transmission.  The  saving  in 
bandwidth  cannot  be  assessed  until  these  figures  are  known,  but  such  a  system  is 
bound  to  be  more  economical  in  bandwidth  because  of  the  inevitably  large  number  of 
sequences  that  do  not  occur  in  speech  which  other  types  of  vocoder  are  capable  of 
transmitting  and  the  pattern  vocoder  is  not. 

The  preceding  paragraphs  summarize  the  operation  of  the  principal  vocoder 
systems  that  have  been  or  are  being  tried.  The  principle  by  which  they  achieve  band¬ 
width  economy  is  shared  by  them  all:  using  our  knowledge  of  the  operation  of  the 
vocal  organs  and  of  the  human  perceptive  mechanism  and  without  losing  any  precision 
in  specifying  those  characteristics  that  are  relevant  to  intelligibility,  the  sound 
input  is  classified  into  a  smaller  number  of  categories  and  significant  variations 
are  restricted  to  a  slower  rate  than  would  be  possible  for  the  bandwidth  and  signal- 
to-noise  ratio  of  the  original  speech  wave.  This  simplified  description  of  the 
input  is  transmitted  and  then  used  to  control  a  synthesiser  in  such  a  way  that  a 
sound  wave  similar  to  that  at  the  input  is  generated.  Our  knowledge  is  not  yet  ex¬ 
tensive  enough  to  obtain  a  specification  of  articulatory  action  from  the  acoustic 
wave.  Work  is  proceeding  in  this  direction  and  results  may  perhaps  be  useful  in 
achieving  further  bandwidth  compression. 


SPEECH  TRANSMISSION  SYSTEMS  USING  LINGUISTIC  PRINCIPLES  TO  ACHIEVE  BANDRIDTH 
ECONOMY 

As  was  pointed  out  earlier,  much  greater  economies  in  channel  capacity  could 
be  achieved  if  the  acoustic  input  was  classified  in  terms  of  linguistic  categories. 
Aa  a  rule  quite  a  number  of  acoustic  patterns  (or  of  articulatory  configurations) 
or  even  sequences  of  these  correspond  to  one  linguistic  unit.  Therefore  the  total 
number  of  possible  linguistic  units  is  smaller,  or  they  vary  at  a  lower  rate  or 
both,  and  these  linguistic  units  can  be  transmitted  over  lines  with  smaller  channel 
capacity.  Unfortunately  the  rules  for  classifying  the  acoustic  patterns  into  cate¬ 
gories  corresponding  to  linguistic  units  are  not  yet  known  and  neither  are  the  rules 
for  the  reverse  process',  for  controlling  a  speech  synthesiser  from  a  phonemic  input. 
One  attempt  has  recently  been  made  to  summarise  our  knowledge  of  phonemic  synthesis 
(42)  but  the  construction  of  a  practical  synthesiser  based  on  these  rules  has  not 
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yet  been  tried.  A*  far  as  linguistic  recognisers  are  concerned,  several  attenpts 
have  been  made  to  classify  speech  wave  inputs  into  various  linguistic  categories 
such  as  phonemes  or  words.  Any  device  which  classifies  the  speech  wave  input  into 
categories  corresponding  to  linguistic  units  and  then  indicstes  the  result  by  pro¬ 
ducing  a  coded  signal  which  has  a  one-to-one  relationship  with  the  linguistic  unit 
is  called  sn  automatic  speech  recogniser;  depending  on  the  linguistic  unit  which 
forms  the  basis  of  this  operation  there  are  automatic  phoneme  recognisers,  auto¬ 
matic  word  recognisers,  etc. 

The  earliest  attempt  to  construct  a  phoneme  recogniser  wss  probably  that  of 
Dreyfus-Graf  (7)  (8)  whose  method  is  based  entirely  on  a  rough  analysis  of  the 
spectral  envelope  of  the  speech  sound  wave.  The  input  is  applied  to  band  pass 
filters  which  divide  the  spectrum  into  six  sdjacent  bands  and  cover  the  total  range 
of  80  c.p.s.  to  3800  c.p.s.  The  output  of  these  filters,  after  rectification  and 
smoothing,  is  used  to  control  the  deflection  of  a  pen  recorder.  The  pen  can  move 
in  any  one  of  six  directions,  st  an  angle  of  60°  to  each  other  and  all  in  the 
horizontal  plane,  and  the  amplitude  and  direction  of  the  deflection  correspond  to 
the  resultant  of  the  outputs  of  the  sj.x  band  pass  filters;  the  pen  writes  on  a 
sheet  of  paper  that  is  moved  past  the  pen  at  a  uniform  rste.  In  response  to  a 
speech  input,  the  pen  will  draw  figures  which  should  be  characteristic  of  the  spec¬ 
tral  pattern  which  gave  rise  to  them.  Using  one  speaker  only  and  for  sounds  pro¬ 
duced  in  isolation  or  words  spoken  slowly,  recognisably  different  patterns  were 
drawn  by  the  pen  for  different  speech  sounds.  In  some  versions  of  his  machine, 
Dreyfus-Graf  replaced  the  pen  recorder  by  a  number  of  differential  relays  which, 
depending  on  the  configuration  of  output  of  the  six  filters,  operated  one  of  a  set 
of  contacts.  These  in  turn  operated  the  keys  of  a  typewrite- 

To  ensure  that  a  separate  figure  is  drawn  for  successive  phonemes,  it  is 
necessary  to  return  the  pen  to  the  central  position  whenever  a  new  phoneme  starts. 
This  raises  the  difficult  problem  of  deciding  when  one  phoneme  ends  and  the  next 
starts:  the  so-called  “gating"  problem.  "Gating*  is  a  problem  common  to  all 

phoneme  recognisers  and  arises  from  the  fact  that  the  speech  sound  wave,  a  con¬ 
tinuous  event,  has  to  be  analysed  in  terms  of  a  sequence  of  linguistic  units  which 
are  by  definition  discrete  entities,  one  following  the  other.  Dreyfus-Graf  first 
operated  his  phoneme-gate  after  a  fixed  interval  of  about  1/20  sec.  and  this  ob¬ 
viously  made  the  shape  of  the  output  patterns  very  dependent  on  the  rate  of  speak¬ 
ing.  Later  he  used  the  rate  of  change  of  averaged  speech  intensity  variations  for 
gating:  a  fresh  phoneme  recognition  was  started  whenever  the  overall  intensity, 
averaged  over  a  short  time  interval,  showed  a  fast  change.  Some  versions  of 
Dreyfus-Graf  s  system  used  this  wave  envelope  also  as  additional  information  for 
the  recognition  of  consonants.  The  intensity  time  function  was  applied  to  a  bank 
of  filters  dividing  the  2  c.p.s.  to  64  c.p.s.  frequency  range  into  four  adjacent 
bands.  Dreyfus-Graf  called  the  prominent  components  of  the  resulting  spectrum 
sub- formants"  and  used  them  extensively  in  recognition. 

Basically,  then,  Dreyfus-Graf ' s  method  produces  a  specialised  visual  repre¬ 
sentation  of  the  acoustic  spectrum  and  is  based  on  the  assumption  of  a  one-to-one 
relationship  between  spectral  patterns  and  phonemes.  It  performs  a  true  recogni¬ 
tion,  that  is,  classification;  process  only  in  so  far  as  it  converts  the  wave  con¬ 
tinuum  into  discrete  segments.  Within  each  time  segment,  however,  the  visual 
representation  of  the  acoustic  spectrum  is  along  a  continuous  scale  and  therefore 
no  classification  is  in  fact  carried  out  by  the  "recogniser";  instead  the 
classification  must  be  performed  by  the  observer  of  the  written  patterns.  In 
this  sense  therefore  this  is  not  reslly  s  phoneme  recogniser  at  all  but  is  more 
like  a  vocoder.  In  s  vocoder  the  spectral  analysis  is  also  performed  on  a  con- 


-22- 


tinuoua  scale,  but  the  process  in  reversible  end  the  output  of  the  analyser  can  be 
used  conveniently  for  the  synthesis  of  a  corresponding  sound  wave  which  is  then 
presented  to  a  listener;  it  ia  the  listener  who  then  classifies  these  sound  patterns 
into  the  corresponding  phonemic  categories  by  using  the  normal  method  of  speech 
recognition  at  his  dispoaal.  Similarly  in  Dreyfua-Graf 'a  system  it  is  the  observer 
rather  than  the  automatic  recogniaer  (despite  the  fact  that  the  machine  is  called  a 
phoneme  recogniaer)  that  carries  out  the  classification;  the  observer  finds  this  a 
much  more  difficult  task  than  the  classification  of  acoustic  patterns  because  the 
framework  of  normal  speech  recognition  already  acquired  cannot  be  used. 

On  the  other  hand,  true  classification  of  acoustic  patterns  into  groups  corre¬ 
sponding  to  linguistic  units  is  carried  out  by  the  automatic  phoneme  recogniaer 
designed  by  Wiren  and  Stubbs  (55).  In  their  method,  the  speech  input  was  examined 
for  the  presence  or  absence  of  acoustic  properties  thought  to  be  characteristic  of 
certain  linguistic  classes  and  the  results  were  used  in  a  succession  of  binary 
selections  to  reach  a  final  classification  into  phonemic  groups.  Ihis  approach 
was  baaed  on  the  idea  of  distinctive  features  (34).  The  distinctive  features,  as 
proposed  by  Jakobson,  Fant  and  Halle,  are  a  set  of  linguistic  attributes  and  the 
listener  identifies  the  phonemes  by  a  series  of  binary  decisions  based  on  the 
presence  or  absence  of  some  or  all  of  these  features.  Jakobson,  Fant  and  Halle 

some  acoustic  correlates  of  these  linguistic  features  and  Wiren  and  Stubbs 
have  baaed  the  operation  of  their  phoneme  recogniaer  on  the  detection  of  these  and 
other  acoustic  correlates  and  using  them  for  phoneme  identification  in  a  succession 
of  binary  decisions  similar  to  that  suggested  by  the  distinctive  feature  approach. 

The  distinctive  oppositions  on  the  linguistic  side  for  which  acoustic  correlates 
were  sought,  were  voiced/ unvoiced,  atop/fricative,  non-turbulent/voiced  turbulent, 
vowel/vowel-1 ike  consonant  and  acute/grave.  Most  of  the  acoustic  distinctions  de¬ 
pended  on  the  spectral  distribution  of  energy  but  amplitude  and  rate  of  rise  of  am¬ 
plitude  were  also  used  as  cues.  Different  parts  of  the  system  were  tested  with  up 
to  several  hundred  utterances  of  anything  frpm  4  to  20  different  speakers,  giving 
success  scores  that  varied  from  50%  to  almost  100%.  The  results  are  a  reflection 
of  the  fact  that  the  distinctive  features  and  their  acoustic  correlates  are  far 
from  being  related  in  a  one-to-one  manner. 

Another  system  of  automatic  speech  recognition  is  described  in  two  publica¬ 
tions  from  the  Bell  Telephone  Laboratories (11)  (12)  in  which  two  versions  of 
basically  similar  equipment  are  described.  One  of  them  is  a  spoken  digit  recog¬ 
niaer  which  is  concerned  with  the  recognition  of  the  ten  numbers  0  to  9,  each 
pronounced  in  isolation.  In  the  first  stage  of  the  recognition  process,  the 
speech  wave  is  applied  to  a  bank  of  10  filters,  similar  to  those  used  in  Dudley’s 
channel  vocoder.  In  a  aeries  of  preliminary  experiments  the  average  spectrum, 
as  represented  at  the  output  of  the  10  filters,  was  obtained  for  10  different 
sustained  speech  sounds,  some  of  them  vowels  and  others  consonants,  all  of  them 
having  been  found  significant  in  the  recognition  of  the  spoken  digits.  During  the 
recognition  process  the  spectral  envelope  of  the  speech  input  is  matched  separately 
with  each  of  the  reference  patterns  obtained  from  the  preliminary  experiments. 

The  identity  of  whichever  stored  pattern  matches  the  input  best  is  indicated  by 
the  operation  of  the  appropriate  one  of  ten  relays  provided.  Whenever  during  the 
utterance  of  the  digit  the  spectrum  of  the  input  changes  sufficiently  for  the 
best  match  to  have  shifted  to  another  of  the  stored  patterns,  then  the  previously 
energised  relay  releases  and  the  appropriate  fresh  relay  operates.  Further  pre¬ 
liminary  experiments  are  needed  for  the  last  stage  of  the  recognition  process: 
the  average  duration  for  which  each  of  the  10  relays  is  operated  during  the  utter¬ 
ance  of  each  of  the  10  digits  is  determined.  The  actual  duration  for  which  each 
of  the  relays  is  operated  whilst  the  speech  input  to  be  recognised  is  articulated 
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i*  compared  with  each  of  the  duration  patterns  obtained  in  the  preliminary  experiments 
and  the  best  match  is  indicated  and  provides  the  final  choice.  This  recognition  pro¬ 
cess  corresponds  in  effect  to  the  selection  of  the  best  match  obtained  by  comparing 
the  frequency-amplitude-time  spectrogram  of  the  input  with  each  of  10  reference 
spectrograms,  one  for  each  of  the  10  digits  as  obtained  in  the  preliminary  experiments. 
The  time  dimension  of  these  spectrograms  takes  into  consideration  only  the  duration  of 
each  spectral  element  and  not  the  order  in  which  they  occur.  The  machine  recognises 
correctly  about  97%  of  the  numbers  spoken  into  it  by  one  speaker  and  its  performance 
deteriorates  when  several  voices  are  used.  This  then  is  a  true  automatic  word  recog- 
niaer  in  which  the  machine  categorises  the  acoustic  input  into  10  classes  corresponding 
to  linguistic  units:  words.  Also,  the  machine  has  a  built-in  a  priori  knowledge  of 
all  the  words  that  are  possible  in  the  language  it  has  to  deal  with,  10  in  this  case, 
and  recognition  is  based  not  on  the  measurement  of  some  absolute  values  but  on  finding 
that  one  of  the  possible  categories  to  which  the  input  is  most  similar.  The  high 
degree  of  success  is,  of  course,  in  no  small  measure  due  to  the  fact  that  only  10 
words  are  possible  in  the  recogniser's  language. 

A  very  similar  development  of  this  system  is  an  attempt  to  achieve  a  phonemic 
transmission  system.  Information  about  the  identities  of  the  spectral  patterns,  re¬ 
cognised  in  the  manner  of  the  digit  recogniser  just  described,  are  coded  and  trans¬ 
mitted.  At  the  receiving  end  this  information  is  used  to  operate  a  synthesising  cir¬ 
cuit  which  will  generate  a  sound  wave  with  an  appropriate  spectrum.  In  this  way  the 
channel  capacity  of  the  transmission  system  need  be  no  greater  than  what  is  required 
for  the  transmission  of  one  of  10  symbols  following  each  other  at  some  slow  rate,  say 
15  per  second.  The  system  was  first  tried  with  only  the  10  spoken  digits  as  before 
and  two  listeners  recognised  almost  all  the  words  spoken.  When  the  vocabulary  was 
increased  to  include  another  37  mono-syllabic  words,  then  the  score  dropped  to  about 
50%.  This  system,  much  extended  in  several  ways  and  using  digital  methods,  is  being 
tried  again  (49)  and  seems  promising. 


CHAPTER  III 

THE  THEORETICAL  BASIS  OF  THE  AUTOMATIC  SPEECH  RECOGNISER  TO  BE  CONSTRUCTED 

It  is  clear  from  the  description  of  the  existing  systems  that  a  method  for 
really  successful  automatic  speech  recognition  has  not  yet  been  found.  In  the 
search  for  a  solution  it  has  always  been  realised  that  phonetic  context  and  other 
variables  will  influence  the  acoustic  features  that  characterise  the  phonemes  and 
words.  It  has  always  been  tacitly  assumed,  however,  that  there  are  some  invariant 
acoustic  features  that  characterise  a  phoneme  and  that  are  always  present  when  that 
particular  phoneme  is  spoken  by  the  speaker  or  recognised  by  the  listener.  It  was 
thought  that  these  invariant  features  are  often  hidden  by  the  presence  of  other, 
less  relevant,  acoustic  features  or  can  be  obscured  by  distorting  the  speech  sound 
wave,  but  that  the  listener  can  detect  them  nonetheless  and  thereby  recognise  the 
phoneme  sequence.  It  was  said  that  automatic  speech  recognition  could  be  achieved 
by  detecting  these  invariants,  always  present  although  sometimes  hidden,  if  only 
their  nature  was  uncovered  by  further  research  and  their  characteristics  specified. 
Ikiring  the  past  decades  considerable  effort  has  been  expended  on  finding  these 
invariants.  The  development  of  sound  spectrography  or  "Visible  speech"  (45)  by 
the  Bell  Telephone  Laboratories  provided  a  most  valuable  means  for  the  careful 
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Tef  and  triggered  off  research  at  the  Haskins  Laboratories 
u;  (39;  (40)  (41)  (43)  and  elsewhere.  Although  these  studies  have  advanced  our  know- 
edge  of  the  acoustic  correlates  of  phonemes,  etc.  immeasurably,  they  did  not  produce 
evidence  about  the  invariants:  it  is  the  thesis  of  the  work  described  in  this  report 
that  there  are,  in  fact,  no  such  invariants  and  that  speech  recognition  is  possible 
without  their  existence.  Phoneticians  are  familiar  with  many  examples  in  which,  in 
fact,  the  same  acoustic  wave  is  recognised  as  one  phoneme  or  as  another,  depending  on 
circumstances.  For  instance,  the  words  nan  and  men  are  distinguished  from  each  other 
Dy  the  vowel  phoneme.  It  is  known  that  the  quality  of  this  vowel  varies  considerably 
from  speaker  to  speaker  and  that  it  is  quite  possible  that  for  one  speaker  the  vowel 
quality  for  man  will  be  the  same  as  another  speaker's  men:  nevertheless  when  someone 
listens  to  these  two  speakers  there  will  be  no  difficulty  in  distinguishing  the 
singular  from  the  plural.  A  more  quantitative  demonstration  of  the  same  effect  is 
given  by  the  results  of  an  experiment  in  which  listeners  had  to  recognise  synthetic 
speech  (37).  Different  versions  of  a  carrier  sentence  were  synthesised  in  which  the 
frequencies  of  the  first  two  formants  were  varied;  four  syllables  with  different 
formant  configurations  were  also  synthesised.  In  the  experiment  each  test  item  con¬ 
sisted  of  a  carrier  sentence  followed  by  one  or  the  other  of  the  four  syllables  and 
the  listeners  were  asked  to  identify  the  syllable  as  bit,  bet,  bat  or  but.  The  re¬ 
sults  showed  that  one  and  the  same  syllable  was  recognised  quite  differently  depend¬ 
ing  on  the  range  of  formant  frequencies  used  in  the  carrier  sentence  that  preceded 
the  particular  presentation  of  the  syllable.  For  instance,  in  the  case  of  one  of 
the  syllables,  when  it  was  heard  following  a  particular  carrier  sentence,  it  was  re¬ 
cognised  as  bet  92%  of  the  time  and  the  judgments  changed  to  97%  bit  when  the  first 
formant  frequency  of  the  carrier  sentence  was  raised.  This  experiment,  then,  provides 
further  evidence  to  show  that  depending  on  circumstances  one  and  the  same  acoustic 
event  can  be  recognised  as  one  linguistic  unit  or  another  and  therefore  speech  recog¬ 
nition  cannot  be  determined  by  these  so-called  invariant  acoustic  features.  If  there 


are  no  invariants,  how  does  the  listener  recognise  speech?  This  question  may  be 
answered  best  by  re-examining  the  complete  chain  of  events  that  takes  place  when  a 
person  speaks  to  and  is  understood  by  another;  based  on  this  it  might  then  be  possible 
to  define  how  far  a  model  of  the  human  mechanism  for  speech  recognition  could  be  used 
as  an  automatic  speech  recogniser. 


When  a  speaker  wants  to  communicate  with  another  person,  he  first  organises  what¬ 
ever  he  wants  to  say  in  linguistic  terms  or  in  other  words  he  formulates  the  information 
in  the  linguistic  code.  Language  consists  of  a  system  of  units  that  are  combined  into 
larger  units  according  to  rules  peculiar  to  each  language.  For  most  purposes  the 
phoneme  is  considered  the  smallest  convenient  unit,  although  the  phoneme  itself  can 
be  regarded  as  the  combination  of  constituent  elements,  the  distinctive  features  (34). 
The  phonemes  can  be  combined  into  larger  units:  the  morpheme,  the  word,  the  sentence; 
each  of  the  larger  units  can  consist  of  one,  or  a  combination  of  several,  of  the  smaller 
units  and  represents  a  definite  category  of  meaning.  The  number  of  the  units  varies 
from  language  to  language.  In  English  the  number  of  phonemes  is  about  40;  there  are 
probably  several  tens  of  thousands  of  morphemes  and  words  and  the  number  of  different 
sentences  is  very  much  larger  still.  During  speech  this  linguistic  code  is  transformed 
into  a  physiological  one  by  the  generation  of  a  complex  pattern  of  nerve  impulses  at 
various  levels  of  the  central  nervous  system;  this  pattern  eventually  produces  a  set 
of  instructions  that  reach  the  muscles  of  the  vocal  organs  via  the  appropriate  motor 
nerves.  The  activity  of  these  muscles  produces  movement  of  the  vocal  organs,  the 
tongue,  the  lips,  teeth,  soft  palate  and  vocal  cords.  The  movement  of  the  vocal  organs 
generates  the  speech  sound  wave  and  brings  about  a  transformation  of  the  encoded  speech 
information  from  a  physiological  code  into  an  acoustic  code.  The  sound  wave  reaches 
the  listener  s  ears,  stimulates  his  hearing  mechanism  and  thereby  generates  a  pattern 
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of  nerve  impulses  in  the  acoustic  nerve:  this  constitutes  a  re-conversion  of  the 
acoustic  code  into  a  physiological  code.  The  nervous  activity  travels  along  to  the 
central  nervous  system  and  eventually  reaches  the  cortex  of  the  brain  and  influences 
the  nervous  activity  already  there.  The  integration  of  the  nervous  signals  arriving 
from  the  lower  levels  with  the  existing  cortical  activity  somehow  or  other  brings 
about  the  recognition  of  a  sequence  of  linguistic  units  and  eventually  the  under¬ 
standing  of  the  message.  In  the  process  of  transforming  the  linguistic  code  into 
physiological  and  acoustic  codes  and  back  into  a  linguistic  code  the  relationship 
between  events  on  any  two  of  the  levels  is  by  no  means  a  one-to-one  relationship. 
Also,  whilst  events  on  the  linguistic  level  consist  of  a  sequence  of  discrete  units, 
the  acoustic  changes  and  also  many  of  the  physiological  changes  are  continuous  in 
nature.  The  transformation  from  the  continuous  acoustic  signal  to  the  sequence  of 
discrete  linguistic  units  requires  quite  a  number  of  stages,  but  it  is  proposed 
here  that  they  all  belong  to  one  of  two  main  types  of  process:  first,  the  assign¬ 
ment  of  a  sound  to  one  or  other  linguistic  category  based  on  the  acoustic  character¬ 
istics  of  the  sound  wave  input,  and  secondly,  the  modification  of  this  process  of 
primary  recognition  by  the  statistical  and  structural  constraints  of  the  language. 

A  considerable  body  of  information  is  available  about  the  acoustic  cues  on  which 
the  primary  recognitions,  that  is  the  classification  of  all  the  incoming  sound 
patterns  into  the  framework  of  the  40  or  so  phonemic  categories,  are*  based.  This 
large  body  of  detailed  factual  data  about  cues  for  speech  recognition  are  not  only 
interesting  in  themselves  but  they  also  allow  certain  general  conclusions  to  be 
made  about  the  way  in  which  primary  recognitions  are  made.  It  has  been  established, 
for  example,  that  often  more  than  one  cue  may  serve  to  identify  a  phoneme:  the 
Haskins  work  has  shown  that  plosive  consonants  are  identified  by  the  spectral  posi¬ 
tion  of  the  “plosive  burst"  and  also  by  the  nature  of  the  formant  "transition"  of 
the  adjoining  vowel.  Each  of  these  cues  on  its  own  is  sufficient  to  identify  the 
plosive,  but  in  normal  speech  both  cues  are  discernible  simultaneously:  this  type 
of  redundancy  is  one  of  the  reasons  for  the  well-known  fact  that  speech  intelligibi¬ 
lity  is  maintained  even  after  severe  distortion  of  the  speech  sound  wave.  Not  all 
the  cues  for  recognition  are  spectral,  that  is  connected  with  the  formant  structure: 
intensity,  fundamental  frequency  and  duration  may  also  play  their  part.  Often  the 
cues  for  recognition  are  not  the  absolute  values  of  the  acoustic  signals  along 
these  dimensions,  but  their  relative  values.  For  example,  turbulent  energy  at  the 
end  pf  an  utterance  will  usually  lead  to  the  recognition  of  a  fricative  consonant 
in  the  final  position;  the  duration  of  the  turbulent  energy  can  serve  as  a  cue 
for  classifying  that  sound  as  a  voiced  or  unvoiced  phoneme  (5),  It  is  not  the  abso¬ 
lute  duration,  however,  that  matters,  but  the  duration  relative  to  that  of  the  pre¬ 
ceding  vowel  sound:  turbulence  of  a  given  duration  nay  be  interpreted  as  a  voiced 
or  as  an  unvoiced  fricative  depending  on  whether  the  duration  of  the  preceding 
vowel  sound  is  long  or  short.  Similarly,  for  fundamental  frequency  and  intensity 
it  is  the  relative  rather  than  the  absolute  values  that  matter. 


Although  such  extensive  data  are  not  available  about  the  acoustic  cues  for 
phonemic  classification,  these  cues  do  not  provide  anything  in  the  nature  of  an 
invariant  relationship  between  acoustic  characteristics  and  phonemic  class.  The 
acoustic  cues  for  a  particular  phoneme  are  widely  scattered  about  some  mean  and 
the  scatter  is  sufficiently  large  to  produce  considerable  overlap  into  the  acoustic 
areas  of  other  phonemes.  An  additional  complication  is  that  there  is  no  obvious 
way  in  which  the  acoustic  sequence  can  be  segmented  to  correspond  to  the  successive 
phonemes.  Even  when  divisions  are  introduced  at  the  boundaries  of  acoustically 
dissimilar  sections,  it  is  often  found  that  acoustic  characteristics  on  both  sides 
of  the  boundary  have  to  be  considered  to  identify  a  phoneme.  If,  for  example,  the 
spectrum  of  a  syllable  like  /ni:/  is  considered,  a  clear  division  into  two  segments 
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will  be  observed.  In  the  initial  segment,  produced  when  the  air  was  flowing  through 

the  nose,  the  sound  intensity  is  low  and  there  are  only  a  few  fairly  broad  formants 

Iwiv^h  “m  n  segment  the  intensity  is  much  higher,  the  fLmants  much  more 

with  the  firstte^en?1  ^  t0  identify  the  initial  consonant 

with  the  first  segment  and  the  following  vowel  with  the  second  segment.  In  fact  both 

to  dentify  th? nasai  the  Jy  Ji. 

three  English  n.!!]°nant  18  3  nasal  one  and  the  second  segment  indicates  which  of  the 
three  Lnglish  nasal  consonants  is  concerned. 

that  6nablf®  the  h“  liste,,er  then  to  recognise  speech  with  the  high  efficiency 

f9  ,l;.USU:lly  caPable  of-  in  i-t  there  is  no  clear-cut  relationship  between  * 
the  acoustic  characteristics  and  the  phonemes?  The  answer  is  that  when  the  listener 

wa^gb^eYPT  I6  T!  n0t  °nlLy  thC  acoustic  information  derived  from  the  sound 
wave  but  also  his  knowledge  of  the  subject  matter  and  of  the  rules  of  the  language 

reco’nis-  F  aCqUlrCS  OVer  the  Vears  as  language  is  learnt.  When 

ecogmsing  English  speech,  for  example,  the  listener  knows  that  he  has  to  recognise 

phonemes  in  a  system  containing  40  units  in  all.  He  knows  that  these  phonemes  do  not 
follow  each  other  in  any  sequence  but  that  certain  sequences  are  much  more  likely  than 
others.  He  also  knows  that  the  phonemes  combine  into  morphemes  and  the  morphemes  into 
words.  He  knows  all  the  possible  morphemes  and  words  in  the  language,  the  rules  for 
combining  them  into  sentences  and  also  the  ways  in  which  expectations  on  the  higher 
levels  will  affect  those  on  the  lower  levels:  once  a  number  of  words  have  been  recog- 
nised  they  will  determine  what  set  of  words  is  most  likely  to  follow  and  this  informa¬ 
tion  is  fed  back  and  will  influence  the  sequential  probabilities  on  the  morphemic  and 
phonemic  levels  The  strong  effect  of  the  constraints  at  the  higher  levels  on  the  ex¬ 
pectations  at  the  lower  ones  can  be  demonstrated  experimentally.  Results  of  a  test 
are  available  (22)  in  which  the  extent  to  which  the  subject  could  predict  the  phoneme 
sequence  in  a  sentence  was  measured.  The  subject  did  not  know  in  advance  what  the 
sentence  was  but  had  to  guess  it  phoneme  by  phoneme;  a  record  was  kept  of  the  number 
of  times  he  had  to  guess  before  obtaining  the  correct  answer  for  each  of  the  phonemes 
in  the  sequence.  The  results  show  that  about  half  of  the  first  guesses  were  correct, 
demonstrating  the  strong  effect  of  linguistic  constraints.  The  results  show  further 
that  the  number  of  wrong  guesses  (the  uncertainty)  was  greater  for  phonemes  at  or  near 
the  beginning  of  words  and  that  the  number  of  these  wrong  guesses  at  the  beginning  of 
words  becomes  less  and  less  for  words  towards  the  end  of  the  sentence,  showing  how  the 
effect  of  constraints  at  higher  levels  is  fed  back  to  the  lower  ones. 

Generally  speaking  then,  when  phoneme  recognitions  are  made  certain  expectations 
are  available  which  restrict  the  alternatives  from  which  a  choice  is  to  be  made  when 
the  acoustic  information  is  received.  Sometimes  these  expectations  are  so  strong 
t  at  the  final  choice  can  be  made  without  acoustic  data  altogether.  The  realisation 
that  such  sequential  constraints  do  in  fact  operate  makes  it  possible  to  omit  the 
vowels  in  the  written  form  of  certain  languages:  it  was  realised  that  once  the  con¬ 
sonants  were  written  down  the  reader  could  guess  the  vowels  by  using  his  knowledge 
of  the  language.  Although  in  English  the  vowels  are  written  Awn,  most  readers 
would  probably  find  no  difficulty  in  understanding  the  following  sentence  in  which 
only  the  consonants  are  written  and  all  vowels  are  replaced  by  an  x: 

thx  ext  sxt  xn  thx  mxt 


It  may  be  worth  while  to  give  one  more  example  of  the  effect  of  linguistic 
constraints  on  recognition.  When  conversation  is  carried  on  over  a  noisy  telephone 
line  it  often  happens  that  normal  conversation  is  quite  intelligible,  but  as  soon 
3S  an  unusual  word  or  a  proper  name  is  mentioned  it  has  to  be  spelt  out  to  be  under¬ 
stood.  This  shows  that  in  quite  normal  situations  the  acoustic  cues  available  are 
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not  sufficient  and  linguistic  information  is  essential  for  speech  recognition.  For 
the  unusual  words  or  proper  names  the  constraints  fed  back  from  the  sentence  and  word 
levels  are  not  sufficient,  the  acoustic  cues  are  not  unambiguous  enough  and  the  recog¬ 
nition  process  breaks  down.  A  similar  example,  of  course,  is  the  use  in  communication 
systema  of  the  Able,  Baker,  Charlie  spelling  alphabet.  When  the  lettera  are  spelt  out 
as  a,  b,  c,  etc.  the  listener  has  to  rely  largely,  though  not  entirely,  on  acoustic 
cues  for  recognition;  when  the  words  Able,  Baker,  etc.  are  used  instead,  then  the 
English  word  system  provides  considerable  linguistic  constraints  which  make  under¬ 
standing  less  uncertain. 

There  is  a  considerable  body  of  evidence,  then,  showing  that  there  are  no 
acoustic  characteristics  that  have  an  invariant  relationship  with  the  phonemes  and 
which  although  as  yet  unknown  could  be  discovered  by  experiments:  the  human  being 
does  not  rely  solely  on  acoustic  cues  for  speech  recognition  but  utilises  strong 
constraints  arising  out  of  the  linguistic  organisation  of  the  transmitted  information. 
If  even  a  human  being  cannot  recognise  speech  successfully  by  using  acoustic  criteria 
alone,  then  an  automatic  speech  recogniser  is  not  likely  to  be  able  to  do  so  either, 
and  it  seems  worth  while  to  investigate  the  use  of  linguistic  statistics  for  automatic 
recognition.  This  has  been  done  by  constructing  an  automatic  apeech  recogniaer  which 
utilises  both  acoustic  and  linguistic  information. 

Before  embarking  on  the  detailed  design  of  the  recognition  system,  a  general 
question  has  to  be  settled:  the  kind  of  linguistic  unit  in  terms  of  which  the 
recognitions  are  to  be  made.  It  has  already  been  pointed  out  that  an  easential  fea¬ 
ture  of  the  human  recognition  process  is  that  the  acoustic  input  is  classified  into 
a  restricted  number  of  basic  units  and  that  these  are  then  combined  into  larger  units. 
The  automatic  recognition  process  has  to  perform  an  analogous  function  but  in  theory 
the  unit  of  recognition  need  not  necessarily  be  the  phoneme,  it  could  be  one  of  the 
larger  units  and  the  choice  must  be  made  on  the  grounds  of  convenience.  The  larger 
the  unit  the  greater  the  number  of  individual  units  in  the  system:  there  are  only 
about  40  phonemes,  but  many  thousands  of  words  and  an  even  greater  number  of  sentences. 
The  requirements  for  storing  linguistic  information  in  a  machine  dealing  with  phonemes 
are  therefore  considerably  more  modest  than  in  a  word  recogniser.  This  is  even  more 
so  when  the  question  of  storing  the  statistical  information  about  the  probability  of 
sequences  of  these  linguistic  units  is  considered:  the  possible  number  of  sequences 
of  four  phonemes  is  around  2Vi  million  whilst  the  theoretical  maximum  of  just  two-word 
sequences  is  about  a  hundred  million.  At  first  sight  this  implies  that  the  recogniser 
would  have  to  store  a  much  larger  number  of  items  if  the  word  is  chosen  as  the  unit  of 
recognition  than  if  the  phoneme  is  chosen.  This  need  not  necessarily  be  so  because  it 
may  well  be  that  primary  recognition  of  words  is  more  successful  than  that  of  phonemes 
and  therefore  less  linguistic  information  may  be  needed  to  achieve  the  same  overall 
success  than  with  a  phonemic  system.  Furthermore,  a  large  proportion  of  phoneme  and 
of  word  sequences  never  occur  at  all  and  it  is  not  known  how  many  significantly  differ¬ 
ent  values  of  phoneme  and  word  sequence  probabilities  would  have  to  be  stored  for 
equally  successful  recognition.  In  the  light  of  future  evidence,  it  may  be  found 
therefore  that  a  word-based  recognition  system  is  more  economical  in  storage  require¬ 
ments  than  an  equally  successful  phoneme  recogniser,  but  for  the  purposes  of  the 
present  wrk  the  phoneme  seemed  to  be  by  far  the  most  economical  unit.  In  this  way 
the  number  of  basic  recognition  units  could  be  kept  quite  small,  to  just  a  selection 
of  phonemes,  and  the  recogniser  could  still  deal  with  speech  material  consisting  of 
several  hundred  words  and  at  the  same  time  requiring  memory  capacity  for  only  a  few 
hundred  values  of  phoneme  digram  frequency.  Had  the  word  been  used  as  the  basic 
recognition  unit,  then  if  the  system  was  to  deal  with  the  aame  number  of  words  as 
before,  the  memory  of  digram  frequencies  would  have  had  to  be  made  much  larger  (al¬ 
though,  of  course,  the  system  might  also  work  more  efficiently). 


CHAPTER  IV 


THE  DESIGN  AND  CONSTRUCTION  OF  THE  AUTOMATIC  PHONEME  RECOGNISE! 

In  the  light  of  the  foregoing  discussion  it  was  decided  to  construct  an  automatic 
phoneme  recogmser  which  used  both  acoustic  and  linguistic  information  in  its  operation 

The  general  scheme  for  such  a  system  is  shown  in  Fig.  5.  The  speech  sound  wave  is 


SPEECH  INPUT 


Fl«.  5.  Block  diagram  Illustrating  principle  of 
operation  of  automatic  phoneme  recognlser* 

appiied  to  the  acoustic  analyser  where  it  is  examined  in  various  ways  and  a  preliminary 

at  r~t  rlfrtl0n;:  aSed  °Ve;evant  acoustic  characteristics,  is  then  indicated 
at  the  output.  In  another  part  of  the  recogniser,  the  “store  of  linguistic  knowledge* 
informat10"  !«  available  about  the  probability  of  occurrence  of  the  various  linguistic  ’ 
units  as  a  function  of  context.  In  the  case  of  the  present  recogniser,  this  Holistic 
of  th  °"  18  !"  f°n"  °f  dlgram  frequencies,  that  is  the  probability  of  occurrence 

m  t!  immediately  preceding  phoneme.  The  final  recognition  is  made  by  the 
computer  which  combines  the  information  derived  from  the  acoustic  analyser  and  from 
e  store  of  linguistic  knowledge.  The  computer  makes  its  final  decision  then,  by  con- 

Irillh'v?*  a^°U8tl?  c“ea  d”ived  from  the  speech  sound  wave  and  the  sequential 
probabilities  determined  by  language  statistics.  The  decisions  of  the  computer  are 
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indicated  on  an  electric  typewriter.  The  typewriter  was  used  purely  as  a  convenient 
Itlull  re°ordln*  the  output  of  the  computer  rather  than  as  part  of  an  effort  to  con- 

The  oKneinPr  tyr:rl1tr  *  t  device  Which  types  out  the  spoken  into  it. 

K  T  °5  bu^dln?  the  recogniser  was  to  see  what  improvements,  if  any, 

could  be  achieved  when  linguistic  statistics  were  used  to  modify  the  results  of 
acoustic  recognition,  rather  than  to  pursue  the  design  of  the  acoustic  detector  to  ‘a 
gr  at  degree  of  refinement  or  to  achieve  a  practical  automatic  recogniser  as  such. 

The  machine  to  be  described  here  was  designed  to  recognise  English  words,  spoken 
in  isolation  and  using  Southern  English  pronunciation.  The  method  of  selecting  the 
ist  of  test  words  will  be  described  later.  For  the  sake  of  economy  of  construction 
the  phoneme  repertory  of  the  recogniser  was  restricted  to  12  phonemes:  4  vowels,  7 
consonants  and  the  space  between  words  which  was  also  treated  as  a  separate  phoneme 
as  it  had  a  definite  structural  function.  The  vowel  phonemes  were  chosen  to  be  as 
representative  of  English  vowels  as  possible,  /i: /  as  in  heat,  /a:/  as  in  heart, 

/u:/  as  in  hoot  were  selected  as  being  the  three  English  vowels  whose  articulation 
was  nearest  to  the  three  corners  of  the '•vowel  triangle*  and  /  a:/  as  in  hurt  was 
selected  because  its  articulation  corresponds  to  a  position  near  the  centre  of  the 
tylan«le*  w»s  well-known  that  the  recognition  of  the  consonants  was  more 

difficult  and,  while  trying  to  keep  the  selection  as  representative  as  possible,  the 
more  difficult  ones  were  not  necessarily  included.  In  particular,  while  representa¬ 
tives  of  every  “manner  of  articulation"  used  in  English  were  included,  on  the  whole 
those  phonemes  which  are  articulated  with  relatively  high  intensity  were  used;  for 
this  reason  the  consonants  were  as  far  as  possible  of  the  unvoiced  variety.  The 
seven  consonants  were  /t/  and  /k/„  /s/,  and  ///,  /*/  and  /n/,  and  /l/.  Later  the 

repertory  of  the  recogniser  was  extended  to  include  two  additional  consonants,  /z/ 
and  /f/. 


The  test  words  were  recorded  on  magnetic  tape  and  in  all  experiments  the  play¬ 
back  from  the  tape  was  used  as  the  input  to  the  recogniser  instead  of  live  speech. 

Care  was  taken  to  use  a  recording  system  with  as  good  a  signal -to- noise  ratio  as 
possible  in  order  to  accommodate  the  very  wide  dynamic  range  of  speech  sound  waves. 

An  Ampex  600  was  used  for  recording  and  also  for  play-back.  The  recorder  was  modi¬ 
fied  to  run  at  15  in. /sec.  and  full-track  heads  were  used.  In  this  condition  it 
gave  a  signal- to-noise  ratio  of  60  db.  as  measured  on  the  screen  of  a  cathode  ray 
tube.  The  overall  frequency  response,  comparing  the  electrical  input  to  the  recorder 
with  the  electrical  output  from  the  play-back,  was  flat  within  ±1  db.  over  the  range 
60  c.p.s.  to  15,000  c.p.s.  A  Standard  Telephones  and  Cables  type  4021F  moving  coil 
microphone  was  used  which  has  a  comparably  good  frequency  response.  The  playback 
amplifier  was  followed  by  pre-emphasis,  a  certain  amount  of  peak  clipping  and  power 
amplificat ion.  Pre-emphasis  was  used  to  equalise  the  average  speech  spectrum  which 
otherwise  would  be  falling  towards  the  higher  frequencies.  The  pre-emphasis  anounts 
to  about  4  to  5  db.  per  octave  over  the  frequency  range  500  to  4000  c.p.s.  The  cir¬ 
cuit  is  shown  on  the  left  of  Fig.  6.  The  signal  from  the  tape  recorder  is  connected 
to  valve  Vj  which  is  used  as  an ‘anode  follower  with  a  0.005pF  capacity  across  the 
input  resistance.  The  output  of  the  pre-emphasis  circuit  was  peak-clipped  to  provide 
a  crude  form  of  volume  compression;  spectrographic  analysis  showed  that  the  distortion 
produced  by  the  peak  clipping  did  not  materially  modify  the  speech  qaectrum.  The 
clipping  amounted  to  20  db.  relative  to  the  peaks  of  the  speech  wave  as  observed  on 
the  screen  of  a  cathode  ray  tube.  The  circuit  can  be  seen  in  Fig.  6  and  is  centred 
on  the  diodes  and  D2«  When  no  signal  is  applied  the  live  end  of  the  2.7  K  ohm 
resistance  receives  a  bias  of  about  +  5  volts  from  the  positive  H.T.  line.  Any  signal 
arriving  from  Vj  is  then  clipped  symmetrically  at  the  ±2.5  volt  level.  The  output 
of  the  clipper  is  applied  to  the  power  amplifier,  consisting  of  valves  V3  V4  V5  Vg. 

The  output  available  is  8  watts  into  300  ohms  with  under  1%  distortion.  Tbe  output 
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W*.  i.  Circuit  41*(ru  of  filter-bank  aopllflcr.  On  thli  end  ell  ether  circuit  diagram*  the  veluee 

of  reeletence  ere  ehown  In  ohae  end  of  cepecltence 
In  microfarad*,  ualeee  otherwise  atatad. 

of  this  power  amplifier  provides  the  input  for  all  circuits  concerned  with  scoustic 
recognition. 

The  design  of  the  acoustic  recogniser  was  bfeed  on  the  detection  of  well- 
-  established  acoustic  cues  only  because,  as  pointed  out  earlier,  the  purpose  of  the 

acoustic  detector  was  to  carry  out  a  simple  recognition  process  which  could  then  be 
modified  by  linguistic  information;  the  effect  of  using  this  linguistic  information 
«  could  probably  be  evaluated  even  if  the  detection  of  acoustic  cues  is  not  the  most 

perfect  in  the  light  of  what  is  known  on  the  subject. 

All  acoustic  recognition  processes  were  based,  partly  at  least,  on  spectral 
cues  and  the  necessary  spectral  analysis  was  obtained  by  applying  the  speech  sound 
wave  to  a  bank  of  filters.  The  filters  covered  the  range  from  160  c.p.s.  to  8000 
c.p.a.  in  18  adjacent  bands,  with  each  filter  about  a  third  of  an  octave  wide  and 
with  their  mid-band  frequencies  spaced  about  a  third  of  an  octave  q>art  throughout 
the  range.  Each  filter  consisted  of  an  inductively  coupled,  double  section,  series 
tuned  circuit  as  shown  in  Fig.  7.  At  resonance  the  output  voltage  was  about  3.8 


I 
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Tig.  7.  Circuit  diagram  of  analysing  filters. 
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ttaes  greater  than  the  input.  The  frequency  response  curves  for  three  adjacent  sec¬ 
tions  shown  in  Fig.  8  are  typical  of  all  the  filters.  It  will  be  seen  that  the  mid- 
frequencies  of  adjacent  filters  are  a  third  of  an  octave  apart,  6  db.  attamation  is 
obtained  at  the  cross-over  points,  about  15  db.  attenuation  at  the  mid- frequency  of 
the  adjacent  filter  and  about  35  db.  an  octave  from  the  mid-frequency.  The  mid-band 
frequencies  and  corresponding  serial  numbers  of  all  18  filters  are  also  shown  on  this 
graph.  Such  a  logarithmic  distribution  of  bandwidths  and  of  mid-frequency  spacing  was 


Fig,  8,  Frequency  response  curve  of  three  typical 
filters  (numbers  15,  16  and  17)  and  serial 
numbers  with  centre  frequencies  of  all 
18  filters. 
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chosen  because  this  psrticulsr  filter  bank  was  available  at  the  time  the  work  was 
started  rather  than  because  it  was  thought  to  be  the  most  suitable  for  speech  wave 
analysis.  It  would  have  been  preferable  to  have  filter  spscinga  and  bandwidtha 
that  were  constant  at  about  100  to  150  c.p.s.  up  to  about  1  kc.p.s.or  2  kc.p.a. 
and  a  logarithmic  increase  for  the  higher  frequencies.  Such  a  distribution  would 
correspond  to  equal  steps  on  the  “Koenig*  scale  (36),  a  scale  specially  designed 
for  the  analysis  of  speech  waves.  In  the  filter  bsnk  used,  st  the  lowest  frequen¬ 
cies  the  bandwidth  is  so  narrow  that  the  time  constant  is  too  long  to  follow  fast 
variations  of  energy  that  may  be  significant  cues  for  speech  perception,  while  in 
the  1  kc.p.s.  to  2  kc.p.s.  region  the  bandwidth  is  too  wide  to  distinguish  signifi- 
cant  varistiona  of  formant  frequency* 

Each  filter  is  followed  by  a  rectifier  and  smoothing  filter  to  obtain  a  mea¬ 
sure  of  the  energy  level  in  each  frequency  band.  The  rectifier  circuit  is  shown  in 
Fig.  9.  The  values  of  condensers  Cx  C2  C3  were  different  for  the  different  filter 


Fig.  9.  Circuit  diagram  of  filter  rectifier  and  smoothing  filter. 

channels.  Their  values  were  kept  as  small  as  possible  consistent  with  ressonsble 
smoothing  (not  more  than  10%  ripple)  and  a  low  pass  cut-tiff  frequency  (3  db. 
point)  not  less  than  100  c.p.s.  or  the  bandwidth  of  the  preceding  filter  channel, 
whichever  was  lower.  The  2000  ohm  potentiometer  shown  on  the  right-hand  side  of 
the  diagram  provides  a  permanent  bias  of  +  8  volts  st  the  output  of  all  filter 
rectifiers;  this  was  found  convenient  for  the  multiplier  circuits  to  be  des¬ 
cribed  later.  The  input-output  curve,  showing  the  relationship  between  the  peak 
A.C.  input  and  the  D.C.  output  voltage  was  linear  to  1%  over  the  range  used  (8-100 
volts  D.C.  output). 

Turning  now  to  the  actual  automatic  recognition  processes,  the  many  different 
acoustic  cues  that  are  available  for  automatic  recognition  have  been  discussed 
previously  in  publications  (24)  and  in  this  report  on  the  whole  only  those  cues 
will  be  mentioned  that  have  been  utilised  in  the  recognition  processes  described 
here. 


1 
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THE  AUTOMATIC  RECOGNITION  OT  THE  VOWELS 

All  vowels  and  some  of  the  consonants  were  recognised  by  detecting  characteristic 
spectral  envelopes.  It  has  long  been  acknowledged  that  the  most  characteristic  feature 
of  vowel  sounds  is  the  frequency  of  the  formants  and  the  frequencies  of  the  maxima  of 
the  spectral  envelope  are  more  often  than  not  a  close  approximation  to  these  formant 
frequencies.  The  relevant  spectral  peaks  were  determined  in  preliminary  experiments. 

*aa  necessary  because  no  published  information  is  available  on  the  average  formant 
frequencies  of  British  English  (although  data  are  available  for  American  English  (44)). 
It  would  have  been  desirable  to  re-examine  the  structure  of  spectral  peaks  for  the 
speech  material  to  be  used  in  the  automatic  recognition  experiment  even  if  formant  data 
in  the  form  of  average  values  for  English  had  been  available.  This  is  partly  because 
average  values  can  differ  greatly  from  individual  results  for  particular  phonemes  and 
partly  because  the  selectivity  of  the  filters  used  for  analysis  will  influence  the  ex- 
tent  to  which  individual  formants  can  be  isolated.  The  recorded  speech  material  was 
therefore  applied  to  the  bank  of  filters  and  the  rectified  and  smoothed  output  voltage 
of  each  filter  was  recorded  using  a  multi-channel  pen  recorder.  Typical  records  are 
shown  in  Figs.  10(a)  and  10(b).  The  output  of  only  6  different  filters  and  for  only  a 


Fig.  10.  Pen  recording  of  rectified  filter  outputs 
for  a  few  test  words. 
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filter  n,™  f  th<S8e  eXamples:  in  the  Preliminary  experiments  all  18 

The  change  inthe*  V ?  ^8t  ^“P1^  “de  hY  in  this  fashion. 

The  change  in  the  spectral  distribution  of  energy  as  the  vowel  changes  from  /i:/  to 

/»./  to  /a:/  can  be  seen  clearly  in  Fig.  10(a).  and  for  /i:/  and  /a:/  in  Fig.  10(b): 

5 e“ing  rC°rf8  °/  uHi8  kind  £t  W8S  f0Und  6,186  6he  *«  -wel L.  represeniedin 
^  filter  chi61,1?1  °f  inPUt  C°Uld  te  8de<luately  differentiated  by  specifying 
f -1  cl»fnnels  in  which  energy  peaks  occurred  simultaneously.  These  peaks 

IndYsoS  ?  C/nt/  at/«°  C,P-8-  Snd  800  C*P-8-  ^or  /u:/»  at  250  c.p.s. 

^d  1*250  VVV  f°r  /1:/*  “  64°  ",P-8*  and  1'6°°  C-p-8-  for  /«/.  nnd  at  640  c.p.s. 

.  c.p.s.  for  /a./.  The  values  of  frequency  indicate  that  it  is  mostly  the 
first  and  second  formants  that  determine  these  peaks,  although  occasionally,  as  in 
the  case  of  /is/,  it  is  the  third  formant  or  the  merging  of  second  and  third  formants 
in  one  filter  band  that  are  responsible  for  the  spectral  peak.  Automatic  recognition 
of  the  four  vowels  was  obtained  by  comparing  the  spectrum  of  the  speech  input  with 
the  four  spectral  patterns  specified  in  terms  of  the  above  pairs  of  spectral  peaks 
and  indicating  with  which  of  these  four  the  input  corresponds  best.  The  method  used 
to  achieve  such  a  comparison  was  to  multiply  the  rectified  output  voltages  of  the 
appropriate  pairs  of  filters;  in  this  way  as  many  multiplication  products  become 
available  as  there  are  phonemes  to  be  recognised  -  four  as  described  so  far.  He 
products  are  then  compared  in  magnitude  and  the  largest  selected.  A  schematic  dia¬ 
gram  of  such  a  system  is  shown  in  Fig.  11.  This  system  of  spectrum-matching  has 


Fig.  11.  Schematic  diagram  of  spectral  pattern  matching  device. 

several  advantages.  The  incoming  spectral  patterns  are  necessarily  assigned  to 
one  or  other  of  the  categories  of  spectral  configuration  which  makes  up  the  pre- 
etermined  system  within  which  the  automatic  recogniser  operates.  Another  advan¬ 
tage  is  that  no  threshold  judgment  is  involved:  the  recogniser  compares  the 
spectrum  of  the  input  with  each  of  the  reference  patterns  and  selects  the  best 
match,  rather  than  baaing  its  judgment  on  whether  the  combined  energy  from  a 
given  pair  of  filters  exceeds  a  certain  level  or  not.  This  method  also  provides 
quite  a  reasonable  solution  of  the  “segmenting”  or  “gating"  problem:  whenever 
the  spectrum  of  the  input  changes  sufficiently  for  the  best  match  to  shift  from 
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one  reference  pattern  to  another  a  “phoneme  boundary"  is  indicated.  Yet  another 
advantage  is  that  not  only  is  the  best  match  indicated  at  the  output,  but  information 
is  also  given  about  how  closely  the  input  resembles  the  other  reference  patterns. 

This  information  is  available  through  the  output  voltages  from  the  other  multipliers 
and  is  going  to  be  of  great  advantage  later  when  the  results  of  acoustic  recognition 
and  linguistic  information  are  combined  to  give  a  final  recognition.  Fig.  11  shows 
that  the  principal  circuit  elements  of  the  pattern  matching  process  are  the  multi¬ 
pliers  and  the  maximum  detector  and  these  will  now  be  described  in  more  detail. 

The  multiplier  circuit  tried  out  in  the  first  instance  used  a  Mazda  6F33  type 
of  valve.  This  is  a  pentode  with  a  suppressor  grid  which  has  the  same  order  of 
sensitivity  as  the  control  grid.  The  two  voltages  were  applied  to  the  control  grid 
and  suppressor  grid  respectively  and  the  output  was  obtained  from  the  anode.  H»i s 
type  of  circuit  however  was  soon  abandoned,  partly  because  of  its  instability  and 
partly  because  of  its  non-linearity.  The  next  circuit  that  was  tried,  the  one 
which  was  eventually  put  into  use  in  the  recogniser,  carried  out  the  multiplication 
process  by  generating  a  square  wave  in  which  the  mark-space  ratio  was  controlled  by 
one  of  the  voltages  to  be  multiplied  and  the  amplitude  of  the  square  wave  was  deter¬ 
mined  by  the  other  voltage  to  be  multiplied.  The  area  underneath  such  a  square  wave 
is  proportional  to  the  product  of  the  height  end  length  of  the  square  wave  and  hence 
to  the  product  of  the  two  input  voltages.  An  output  proportional  to  this  area  is 
obtained  by  integrating  the  square  wave.  The  square  wave  necessary  for  the  multipli¬ 
cation  process  is  derived  from  a  triangular  wave,  generated  by  the  circuit  of  Fig. 12. 


The  output  of  a  free-running  multivibrator  circuit,  incorporating  valves  and 
V2,  is  amplified  and  squared  by  valves  Vg  V4  D2.  The  square  wave  output  is 
applied  to  the  Miller  integrator  of  Vg  and  thus  provides  a  triangular  wave  which 
is  connected  to  the  final  output  through  the  cathode  follower  Vg.  This  triangular 
voltage,  the  wave  shape  of  which  is  shown  in  Fig.  13(a),  is  used  with  all  the 
multiplier  circuits.  The  principle  of  obtaining  the  variable  square  wave,  neces¬ 
sary  for  the  multiplication  process,  from  the  triangular  voltage  is  shown  in  Fig.  14 
A  thin  slice  of  the  triangular  voltage  is  cut  out  and  the  mark-space  ratio  of  the 
resulting  square  wave  is  determined  by  the  slicing  level  which  in  turn  is  set  by 
one  of  the  voltages  to  be  multiplied.  This  square  wave  is  amplified  and  is  then 
amplitude-limited  at  a  level  determined  by  the  other  voltage  to  be  multiplied. 


(•)  Waveshape  of  tri¬ 
angular  voltage 
generator. 


(b)  Composite  oscillogram 
showing  triangular 
voltage  and  a  square 
slice  from  It. 


(o)  And  (d)  Examples  of 
mark-space  ratios 
obtained  when  slicing 
at  different  levels. 


(e)  A  square  wave  with  the 
same  mark-space  ratio 
as  In  (c)  but  limited 
to  a  different 
amplitude. 


13.  Oscillograms  demonstrating  the 
operation  of  the  multiplier. 


Fig.  14.  Principle  of  operation  of  multiplier 


The  corresponding  circuit  diagram  is  shown  in  Fig.  15.  One  of  the  voltages  to  be 
multiplied  is  applied  to  the  grid  of  the  buffer  cathode  follower  valve  Vi.  The  out 
put  from  \l  is  mixed  at  the  grid  of  V2  with  the  triangular  voltage.  The  combined 


Fig#  15#  Circuit  diagram  of  a  typical  multiplier  section. 
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voltage  is  in  the  form  of  a  triangular  voltage  whose  absolute  level  is  determined  by 
the  input  from  Vp  Valves  V3  and  V4  perform  a  squaring  and  amplifying  function,  there¬ 
by  cutting  a  square  slice  from  the  triangular  voltage  at  a  height  determined  by  the 
voltage  derived  from  Vj.  The  amplified  square  wave  is  then  amplitude-limi ted  at  a 
level  determined  by  the  other  voltage  to  be  multiplied  which  is  connected  to  the  grid 
of  Vg:  the  amplitude  of  the  square  wave  from  V4  cannot  rise  above  that  of  the  cathode 
of  Vg  because  of  the  clamping  diode  Dj.  The  diode  Do  D.C.  restores  the  output  of  V. 
to  the  zero  voltage  level.  The  square  wave  is  then  integrated  by  the  330  K  ohm  resis¬ 
tance  and  the  0..01  uF  condenser  and  the  integrated  voltage  from  the  condenser  provides 
the  output  voltage  proportional  to  the  product  of  the  two  voltages  applied  to  the  grids 
of  V}  and  Vg  respectively.  Typical  wave  shapes  illustrating  the  multiplying  process 
are  shown  in  Fig.  13.  The  frequency  of  the  triangular  voltage  had  to  be  high  compared 
with  the  rate  of  change  of  the  voltages  to  be  multiplied.  These  input  voltages  are 
derived  from  the  speech  spectrum  and  no  significant  change  can  reasonably  be  expected 
in  less  than  l/20th  to  l/30th  of  a  second.  For  this  reason  the  frequency  of  the 
triangular  voltage  was  set  around  3,300  c.p.s.  so  that  at  least  100  cycles  of  this 
voltage  take  place  during  the  shortest  significant  period  of  the  voltages  to  be  multi¬ 
plied.  The  time  constant  of  the  integrating  circuit  is  3.3  msec.,  that  is  about  ten 
times  the  period  of  the  triangular  voltage,  and  the  time  constant  of  the  multiplier 
as  a  whole,  as  measured  on  the  screen  of  a  cathode  ray  tube,  is  about  6msec.  This  time 
constant,  which  is  long  relative  to  the  triangular  wave  period  and  short  relative  to 
the  speed  of  variation  of  the  multiplying  voltages,  ensures  that  the  output  has  no 
noticeable  3,300  c.p.s.  ripple  but  still  follows  the  changes  of  the  input  voltages. 

The  linearity  of  the  multiplying  action  was  checked  by  connecting  both  inputs  to  the 
same  voltage  and  obtaining  a  curve  relating  this  common  input  with  the  square  root  of 
the  corresponding  output.  The  resulting  curve  was  a  straight  line  over  the  whole  of 
the  operating  range  of  20  volts.  The  stability  of  the  circuit  was  good.  The  H.T. 
was  supplied  from  a  stabilised  power  pack  which  is  described  at  the  end  of  the  section 
dealing  with  the  circuitry  of  the  automatic  recogniser.  This  stabilised  power  pack 
not  only  kept  the  H.T.  voltage  constant  to  better  than  1%  but  also  its  output  impedance 
of  under  0.1  ohm  over  the  whole  of  the  relevant  frequency  range  ensured  freedom  from 
interference  from  other  circuits.  The  heaters  were  not  stabilised  but  did  not  affect 
the  output:  any  effect  due  to  heater  variation  would  influence  all  the  multiplier 
circuits  equally  and  would  therefore  not  affect  the  final  result.  The  second  part  of  • 
the  spectral  pattern  matching  arrangement  is  the  maximum  detector  and  both  a  simplified 
and'  a  more  detailed  circuit  diagram  are  shown  in  Fig.  16.  The  voltages  to  be  compared, 
derived  from  the  multipliers  and  all  positive  going,  are  applied  each  to  the  grid  of  a 
different  triode  as  shown  in  Fig.  16(a).  The  triodes  have  a  relay  in  their  anode 
circuits  and  they  have  a  common  cathode  resistance.  This  resistance  is  given  such  a 
value  that  the  current  flowing  through  it  can  operate  one  only  of  the  relays  in  the 
anode  circuits.  Whichever  input  is  the  most  positive  will  divert  the  largest  share  of 
the  current  from  the  common  cathode  resistance  to  its  own  triode.  At  the  same  time, 
the  cathode  of  this  valve  will,  due  to  cathode  follower  action,  go  more  positive  and, 
as  all  cathodes  are  connected  together,  will  cut  off  all  the  other  valves.  In  this 
way  the  valve  with  the  most  positive  voltage  on  its  grid  will  capture  most  of  the 
current  available  from  the  cathode  resistance  and  the  relay  in  its  anode  circuit  will 
operate  while  all  other  relays  release.  The  minimum  amount  by  which  a  voltage  to  be 
compared  has  to  be  more  positive  than  all  the  others  is  the  difference  in  grid  voltage 
corresponding  to  the  operate  and  release  currents  of  the  relays.  This  is  between  one 
and  two  volts  for  the  circuit  which  was  used  and  was  not  considered  sufficiently  small 
compared  with  the  20  volt  range  of  output  voltage  from  the  multipliers.  Each  input  to 
the  maximum  detector  was  therefore  amplified  about  seven  times  before  it  was  applied 
to  the  appropriate  grid  of  the  comparison  circuit.  The  detailed  circuit  is  shown  in 
Fig.  16(b).  Valves  V2  and  form  one  cell  of  the  maximum  detector  circuit  and  are 
repeated  16  times  to  allow  16  different  voltages  to  be  compared.  is  a  cathode 


36  (a).  Maximum  detector  circuit 
Simplified  diagram. 


Fit.  (b)  Mariana  d.t.ctor  circuit.  Detailed  circuit  diafraa  of  one  of  1 
identical  coctions. 
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foUower  input  stage  which  is  cathode  coupled  to  amplifier  valve  V,.  The  gain  of  this  staffe 

th  ^LSed  by,Str°ug  feed-back  from  to  grid.  Further  stabilisation  is  obtained  b^ 

the  cathode  coup  ing  between  Vj  and  V2.  The  overall  gain  of  V:  and  V2  is  about  seven.  \l 

:  thdl:rtly^Tled/°,the  f”duf  the  comParison  valve  V3«  cathode  of  V,  is  common  2 
ith  the  cathodes  of  the  other  15  V3  valves  and  is  connected  to  the  negative  end  of  the  H.T. 

supply  through  the  conmon  cathode  resistance  R.  The  anode  of  each  Vo  is  connected  to  the 

^  H  ^  tht0"8h  ?“  rel.y  J  l„  i„ 

this  diagram.  The  discrimination  sensitivity  of  the  circuit  has  been  increased  to  about 

0.25  volt  referred  to  the  grid  of  Vj  because  of  the  gain  of  seven  provided  by  V,  and  V9.  The 
circuit  is  extremely  stable  The  H.T.  supply  is  stabilised  and  variations  of  heater  v^itag^ 
are  neutralised  by  feed-back  and  also  by  the  fact  that  they  are  likely  to  affect  all  16 
circuits  to  the  same  extent  and  will  therefore  not  influence  the  maximim  selection.  Wien- 
ever  no  speech  input  is  applied  to  the  recogniser  all  the  input  voltages  to  the  maximum 
detector  become  *ero:  no  maximim,  is  evident  and  therefore  random  operation  might  result. 

This  is  avoided  by  adding  a  17th  valve  to  the  chain  of  V,  valves.  The  grid  of  this  valve. 

/A’  1C  f°nne‘:ted  to  a  flxed  D>C‘  voltage  of  54  volts  whilst  the  voltages  at  the  grids  of 

u  ,5  Va  Ve8fVary  from  48  voits  when  no  inPut  »  Wlied  to  the  maximum  detector  to 

about  190  volts  for  maximun  input.  This  means  that  when  all  input  voltages  drop  below 
about  the  last  4%  of  their  total  range,  then  the  current  is  captured  by  V,  and  random 
operation  is  avoided.  The  result  of  the  maximum  selection  is  recorded  by  arranging  that 
the  closing  of  the  contacts  of  a  relay  will  operate  one  of  the  keys  of  an  electric  typewriter. 
The  mechamcai  and  eiectncai  arrangement  for  the  operation  of  the  typewriter  will  be  des¬ 
cribed  after  all  the  actual  recognition  circuits  have  been  discussed. 

THE  AUTOMATIC  RECOGNITION  OF  THE  CONSONANTS  /»,  n.  I,  s,  //. 

All  four  vowels  in  the  repertory  of  the  automatic  recogniser  were  detected  by  using  the 
pattern  matching  process  -  with  quite  a  high  rate  of  success  as  will  be  seen  later.  Wien 
examining  the  spectrograph! c  records,  by  using  the  multi-channel  pen  recorder  as  before,  it 
was  found  that  the  same  method  could  be  used  for  the  recognition  of  a  number  of  continuant 
consonants,  in  particular  /m/,  /n/,  /l/,  /s/  and  ///.  As  would  be  expected,  reference 
patterns  with  peaks  in  the  low  frequency  region  were  needed  to  recognise  the  nasal  and 

onn  consonan^s  “d  in  the  high  frequency  region  for  the  fricatives.  The  filters  centred 
on  200  c.p.s.  and  320  c.p.s.  were  used  for  /m/ ,  the  250  c.p.s.  and  400  c.p.s.  for  /n/  and  the 
c.p.s.  and  500  c.p.s.  filters  for  /l/.  These  pattern  allocations  were  rather  tentative  and, 
as  will  be  seen  later,  these  three  sounds  were  never  well  distinguished  from  each  other;  it  was 
easier  to  distinguish  initial  /m/  and  /n/  sounds  from  final  /m/  and  /n/  sounds  than  it  was  to 
is  inguish  /in/  from  /n/.  Also  quite  different  acoustic  structure  was  found  to  apply  to  initial 
and  final  /l/.  The  initial  /l/  was  almost  indistinguishable  from  /i:/  whilst  the  final  /l/  pro- 
duced  a  single  broad  peak  in  the  400  to  500  c.p.s.  region.  The  ///  produced  a  broad  peak  in  the 
i.to.4  Kc.p.s.  region  whilst  the  /s/  energy  was  higher  than  this,  around  6  to  8  Kc.p.s. 

Ihe  intensity  of  the  fricative  energy  was  high  for  both  fricatives  so  that  there  was  no  dif¬ 
ficulty  in  detecting  the  presence  of  the  sound,  although  the  usual  double  input  to  the  multi¬ 
pliers  was  very  necessary  to  establish  that  the  energy  was  really  concentrated  in  the  appro¬ 
priate  spectral  regions.  If  the  presence  of  energy  in  only  one  channel  had  been  taken  as  the 
criterion  for  recognition,  then  it  would  have  been  easy  to  confuse  ///  with  some  of  the  vowels 

j-rr-i  .°r  whlch  have  hiSh  frequency  formants  overlapping  into  the  range  of  ///.  Similar 
difficulties  would  have  arisen  in  distinguishing  /s/  from  ///.  Some  of  these  points  can  be 
observed  in  the  pen  recording  shown  in  Fig.  10.  For  instance  the  shift  of  fricative  energy 
for  /s/  and  ///  can  be  seen  when  comparing  /si:t/  and  //i:t/.  The  overlapping  of  spectral  cues 
tor  /j /  and  /is/  can  be  seen  on  the  tracing  for  /f i:t/. 

In  this  way  the  first  version  of  the  automatic  recogniser  which  was  extensively  tested 
had  nine  multiplier  circuits  for  carrying  out  spectral  pattern  matching.  Four  of  these  were 
used  for  the  vowels  /a:/  /i:/  /u:/  /a:/  and  five  for  the  consonants  /m/  /n/  /!/  /s /  ///.  The 
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table  below  gives  the  serial  numbers  and  centre  frequencies  of  the  filters  connected  to  each 
of  these  multipliers. 


Phoneme 

Serial  number  of 
filter  outputs 
utilised 

Centre  frequency 
of  filter  outputs 
utilised  c.p.s. 

a: 

13 

and 

16 

640 

and  1,250 

u: 

11 

and 

14 

400 

and  800 

i: 

9 

and 

20 

250 

and  3,200 

a: 

13 

and 

17 

640 

and  1,600 

m 

8 

and 

10 

200 

and  320 

n 

9 

and 

11 

250 

and  400 

1 

11 

and 

12 

400 

and  500 

s 

23 

and 

24 

6,400 

and  8,000 

/ 

19 

and 

21 

2,500 

and  4,000 

THE  AUTOMATIC  RECOCNITION  OF  THE  CONSONANTS  ft,  k/. 

Some  of  the  phonemes  selected  for  the  repertory  of  the  recogniser  could  not  be  detected 
on  spectral  cues  alone.  The  acoustic  characteristics  of  the  phoneme  /t/  for  instance  were 
the  silence  during  the  stop  segment  of  the  articulation  and  a  short  burst  of  hiss  energy 
generated  during  the  release  of  the  stop.  The  silence  is  difficult  to  detect  unless  the 
/t/  is  produced  between  two  other  sounds  and  is  obviously  quite  unsuitable  as  a  cue  for  re¬ 
cognising  an  initial  /t/.  The  short  fricative  burst  has  a  spectrin  that  extents  over 
roughly  the  same  range  as  the  spectrin  of  /s/  and  can  be  seen  clearly  in  the  pen  recordings 
°1  Fig*  10*  Although  the  burst  has  the  same  spectrin  as  /s/,  its  duration  is  always  recog- 
nisably  less  than  for  /a/  Rnd  the  detection  of  this  difference  in  duration  was  used  for 
recognising  /t/.  A  similar  procedure  was  used  for  recognising  /k/  except  that  the  spectrin 
of  the  burst  was  at  a  lower  frequency,  near  that  of  ///  or  even  lower. 

The  circuits  used  for  recognising  /t/  and  /k/  were  therefore  practically  identical  - 
they  both  used  a  form  of  duration  measurement  -  and  they  differed  only  in  obtaining  their 
inputs  from  a  different  filter.  The  basic  principle  of  the  plosive  detectors  is  that  an 
electronic  clock,  set  to  measure  off  a  fixed  time  interval,  is  started  as  soon  as  a  voltage 
appears  at  the  input  of  the  detector  circuit.  If  the  input  voltage  disappears  again  during 
the  span  of  time  measured  off  by  the  clock  then  a  plosive  consonant  is  indicated;  if  the 
input  is  still  present  at  the  end  of  the  measured  period  then  the  plosive  detector  becomes 
inoperative  again  without  giving  an  output  and  the  recogniser  chooses  its  output  from  al¬ 
ternatives  suggested  by  one  of  the  multiplying  circuits.  A  simplified  diagram  of  the  /t/ 
detector  circuit  is  shown  in  Fig.  17.  The  output  of  the  8,000  c.p.s.  filter  is  applied  to 
the  grid  of  the  triode  and  will  operate  relay  A  as  soon  as  the  input  from  the  filter 

exceeds  a  certain  threshold  value.  This  threshold  is  determined  by  the  bias  on  the  cathode 
of  V1(  obtained  from  the  15  K  ohm  and  2  K  ohm  resistors,  and  by  the  setting  of  the  input 
potentiometer.  The  1M  ohm  resistor  and  0.0015  uF  capacitor  prevent  spurious  operation  of 
the  relay  by  voltage  spikes  of  short  duration  that  sometimes  appear  at  the  output  of  the 
filter  but  are  not  connected  with  the  phoneme  /t/.  The  operate  and  release  times  of  relay 
A  and  of  all  the  other  relays  used  in  this  circuit  are  less  than  2  msec. ,  fairly  short 
compared  with  the  duration  of  other  significant  events  in  the  operation  of  the  circuits. 

The  triodes  V2  and  Vo  form  a  one-shot  multivibrator  circuit  in  which  V2  is  normally  cut 
off  and  V3  ia  normally  on.  This  means  that  relay  B  is  normally  operated;  therefore  its 
contacts  are  normally  in  position  3  and  relay  C  is  not  energised.  An  output  of  large 
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iBhlbltloi  roltaf i  to  opoelol 
cnanatl  of  wudaua  dttoetor. 

rif.  17.  SiapUflod  circuit  dlocrao  of  tbo  /t/  detector. 

ThisgiilT!iwrTarirf  atKthC  °UtPUt  °f  thC  8,000  C*p*s*  filter  wil1  operate  relay  A. 
inis  will  trigger  the  multivibrator  and  relay  B  de-energises  for  <,  t-™  j  -  ■  j T 

the  time  constant  of  the  multi-vibrator  Th  l  b  n  f  span  dete™med  by 

C  will  still  not  ,  r*  "  contacts  change  into  position  1  but  relay 

Xtzsrs  sujsrs  r-^/r c,rr 

E.FF r?  ? 

Ss2HsT‘  r££ 

the  ne^eSS“"  y  made  until  the  end  of  the  multivibrator  cycle  and  therefore  the  rest  of 

period  ^ia  isCa^1SerHTat  ***  preVented  from  makin«  a  decision  until  the  end  of  this 
voltage  ~  l  achieved  by  connecting,  for  the  duration  of  the  multivibrator  cycle  a 

-blinf-8rh  ,  thr  u he  mWClmlri  that  any  multiPlier  can  provide  to  the  input  of  a  Lcial 

valve  whfr6  7  thC  maXlmUm  deteCt0r’  effeCt  °f  ^  that  this  VecTaT 

therefore  thePothe  hn°  T*"*'  Wi“  C8ptUre  ““  the  CUrrent  in  the  maximum  detector  and 
detector  te  S  0pcrate  "°  °«P«t  can  be  provided  by  the  maximum 

detector.  The  inhibiting  voltage  is  derived  from  the  HT  line  of  the  plosive  detector 

volta ge  ^  through C  the* lM^h  trig^rS  multivibrator'  *  Provides  the  inhibiting* 
b  j  8!  ?  m  0hm  resistance-  H  the  sound  is  a  short  one  then  a  /t/  ia  indicated 

voW  \fth  rele“eS‘  anerSlaing  relay  C,  and  at  the  same  time  removing  the  inhibiting 

vibrator  cyfle  rela^R  oDera!*'  ^  ^  energised  but  at  the  end  of  the  multi? 

^elay ,B  °P?rates  again  and  its  contacts  will  ahort  out  the  inhibiting  vol- 

of^L^J  rCby  7  the.maxim«"  detector  to  make  an  alternative  choice.  The  purpose 

afte^r^y^eleases  f  it  ^  dischai^  the  °*0015  «F  condenser  rapidly 

y  A  releases  and  the  diode  Dj  ensures  that  the  negative  voltage  which  would 

sh^tod^ut^dV6  th*  gri<1  7  V2  dUring  ***  discharge  of  the  0.0015  uF  condenser  is 
orted  out  and  does  not  interfere  with  the  cycle  of  the  multivibrator. 

durati^niofettegSiingrreC°rdS  similarutoothose  ^own  in  Fig.  10,  it  was  found  that  the 

5r*ec  il  t  '  ln  Lthe  8'000  C'P-S-  filter  f°r  A/  was  never  longer  than 

msec,  and  for  /./  never  shorter  than  about  200  msec.  Other  sounds,  such  as  ///for 


I 
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exanpie,  also  produced  energy  in  the  8,000  c.p.s.  filter,  but  although  the  duration  in  these 
cases  was  sometimes  quite  short,  the  amplitude  was  considerably  smaller  than  that  for  the 

o717t/rS  °£  ‘1  fH  Tt:  l '  *“  *cided  theref»e  « ■«  ^  th^u  .i  5»*i« 

1  '  ,  or  fairly  high,  this  avoided  spurious  operation  by  unwanted  signals  and 

th,t  ^  M  ;pik“  "re  “U  "*“*  »  »«•  in  duration  a.  th,  35. .73. 

.3 “t “  rr,r.”"  "  P'  .■"»  duration  of  th.  -Itirihr.tor  cycl. 


The  circuit  for  the  A/  detector  was  identical  with  that  of  the  /t/  detector  but  it. 

resStTLfr,V'd  rfT  7  f  1 1  r  centred  on  1  600  c.p.s.  As  will  be  seen  later  when  the 
results  are  presented,  the  operation  of  the  /t/  detector  was  fairly  successful.  The  /k/ 

etector  was  very  markedly  less  successful,  although  quite  distinct  voltage  spikes  could  be 
obser^d  for  many  of  the  /k/-s.  Ihfortunately  quite  a  number  of  other  soLsf  but  Ocularly 

hid'.tiihLf  7"*  '  tll“  W“  longer  in  duration  than  the  A/  spike., 

.£  A/^  mt  be  'li’i"“d  bV  nnoothing  without  obliterating 


THE  AUTOMATIC  RECOGNITION  OF  SPACE  BETWEEN  WORDS 


statist  S  T  *  /  “1SO  ^  t0  be  treated  as  a  •)honeme>  09  it  influenced  the 

derived  dl9t^^tion  of  the  other  phonemes.  It  was  recognised  by  a  simple  circuit  which 

rived  its  input  from  the  main  power  amplifier.  The  circuit,  shown  in  Fig.  18,  consists  of 


Fig.  18.  Simplified  circuit  diagram  of  the  space  detector. 

a  single  triode  with  a  relay  in  its  anode  circuit.  The  valve  is  normally  biased  so  that 
the  relay  is  energised  and  its  contacts  connect  a  voltage  to  one  of  the  inputs  of  the 
maximum  detector,  thus  indicating  the  presence  of  silence..  As  soon  as  a  speech  voltage 
appears  at  the  input  of  the  detector,  it  will  be  rectified  and  the  rectified  voltage, 
applied  to  the  grid  of  the  triode,  will  bias  it  to  cut-off  and  the  relay  de-energises. 

.  e  tlme  constant  of  the  rectifier  circuit  is  short  for  charging  and  long  for  decay.  This 
is  necessary  so  that  the  relay  will  give  a  quick  indication  of  the  beginning  of  speech 
hut  at  the  same  time  does  not  operate  during  the  short  gaps  that  occur  in  speech,  for 
example  during  stop  consonants.  The  circuit  will  in  fact  give  an  indication  well  under 
msec,  after  the  beginning  of  speech  but  will  have  a  delay  of  about  one  second  at  the 


-  44- 


lhia  “  a  lon^  delaV  ^an  is  really  required  as  a  break  in  speech  energy  of 
more  than  0.25  sec.  during  words  was  never  observed.  ™  ^ 

,  Tin  fc°.Fig*  18'  ifc  wil1  be  observed  that  the  decay  time  of  the  circuit  will 

the  SST^biiTS"*  A,  thC  KinpUt  P°tentiometer  because  this  will  detennine  how  far 
the  250°K*ohm  SI 7  I'J ^  P°81t^e  bias  applied  to  the  grid  of  the  triode  through 

so  SS  h^.and  4!7  K  ohm  resistances  will  raise  the  threshold  of  operation  of  the  circuit 
the  sDace  d°e  8Purlous  °Peration  by  noise.  A  pen  recording  demonstrating  the  operation  of 
the  space  detector  is  shown  in  Fig.  19.  The  upper  trace  shows  the  speech  input  wctifUd 
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Fig.  19  Pen  recording  illustrating  the  action 
of  the  space  detector. 
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Speech  envelope. 

Space  detector  output. 
20  c/s  tine  marker. 


and  smoothed  with  a  low  pass  filter  of  10  msec,  time  constant.  The  middle  trace  shows 
the  output  of  the  space  detector,  the  upper  position  of  the  pen  corresponding  to  the 
space  indication.  The  third  trace  at  the  bottom  was  produced  by  a  20  c.p.s.  time  marker, 
and  indicates  a  paper  speed  of  about  1M  in. /sec.  The  recording  shows  clearly  the  quick 

response  to  a  speech  input  and  the  slow  decay  and  also  the  insensitivity  to  spurious 
noise. 


This  completes  the  description  of  the  acoustic  recognition  circuits  used  in  the 
version  of  the  automatic  recogniser  with  which  many  of  the  experiments  were  carried  out 
Later  on,  however,  it  was  attempted  to  add  two  more  sounds,  /f/  and  /z/,  to  the  voca¬ 
bulary  of  the  recogniser  and  suitable  circuits  for  acoustic  recognition  had  to  be  found 
for  them.  Both  of  these  used  not  only  spectral  but  also  other  kinds  of  cues  for 
recognition. 


1HE  AUTOMATIC  RECOGNITION  OF  /f/  and  /z/. 

Published  work  on  the  analysis  and  the  recognition  of  fricatives  (30)  (33)  (53) 
suggested  that  intensity  as  well  as  spectral  pattern  might  help  in  the  recognition  of 
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/f/.  The  examination  of  spectral  patterns  for  /f/  sounds  by  means  of  the  18  channel  filter 
bank  showed  that  the  spectral  maximum  for  / f/  is  in  about  the  same  region  as  for  /s/  but  with 
a  markedly  reduced  anplitude  for  /f/  compared  with  /s/.  The  usual  multiplier  circuit  was 
therefore  used  for  recognising  /f/:  the  same  spectral  pattern  was  to  be  detected  as  for  /s/ 
and  therefore  the  two  inputs  to  the  /f/  multiplier  circuit  were  derived  from  the  6,400  c.p. s. 
filters  just  as  for  the  /s/  detector.  An  anplitude  dependent  selection  between  /f/  and  /s/ 
is  achieved  by  not  connecting  the  output  of  the  6,400  c.p.s.  filter  to  the  inputs  of  the  two 
multipliers  directly -but  by  switching  it  instead  to  the  input  of  the  /if  multiplier  if  the 
output  of  the  filter  is  below  a  certain  threshold  value  and  to  the  /s/  multiplier  if  it  is 
above  the  threshold  voltage;  the  output  of  the  8,000  c.p.s.  filter  is  connected  permanently 
to  the  inputs  of  both  multipliers.  In  this  way  only  one  of  these  two  multipliers  can  give 
an  output  at  any  one  time,  the  one  to  which  the  6,400  c.p.s.  filter  is  connected.  The  ampli¬ 
tude  dependent  switching  of  the  filter  output  is  achieved  by  the  circuit  shown  in  Fig.  20. 


Fig.  20.  Diagram  of  amplitude  discriminating  circuit  for  /f/  detector. 

The  filter  output  is  connected  to  a  cathode  follower  input  stage  and  from  there  to  a  Schmidt 
trigger  circuit  formed  by  triodes  V2  and  V3.  The  coil  of  a  relay  is  connected  to  the  anode 
circuit  of  V3  and  in  the  quiescent  state  the  relay  is  energised  and  its  contacts  connect  the 
output  of  the  filter  to  the  input  of  the  /f/  multiplier.  Wien  the  filter  output  exceeds  a 
certain  value  the  Schmidt  trigger  operates,  the  relay  de-energises  and  the  output  of  the 
filter  is  switched  to  the  /s/  multiplier.  The  total  range  of  voltages  from  the  6,400  c.p.s. 
filter  is  about  65  volts,  the  change-over  point  of  the  trigger  is  set  to  about  50  volts  and 
the  sensitivity  of  the  trigger  is  about  1  volt. 

As  far  as  the  recognition  of  /  z/  is  concerned,  it  was  soon  realised  that  the  acoustic 
features  of  this  phoneme  do  not  necessarily  consist  of  the  conventional  concept  of  a  hiss 
modulated  by  vocal  cord  tone  but  is  often  distinguished  by  being  shorter  and  weaker  than  its 
unvoiced  counterpart  /s/.  The  best  way  of  differentiating  it  acoustically  from  /s/  therefore 
is  probably  not  by  searching  for  energy  concentration  in  the  low  frequency  part  of  the  spec- 
trun  but  by  looking  instead  for  the  acoustic  expression  of  the  fortis-lenis  opposition. 
Previous  work  specifically  concerned  with  the  acoustic  cues  for  the  /s/  -  /z/  distinction 
(5)  has  shown  that  the  duration  and  the  intensity  of  the  hiss  play  a  part  in  making  this 
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distinction.  Although  the  results  referred  to  also  indicate  that  it  is  the  relation  of  the 
uration  and  of  the  intensity  of  the  hiss  and  of  the  preceding  vowel  sound  that  serve  as  a 
cue  for  recognition,  it  was  decided  to  try  /z/  recognition  by  measuring  the  absolute  value 
of  the  hiss  associated  with  the  speaking  of  the  /z/  phoneme.  A  similar  circuit  to  that  already 
tried  for  the  recognition  of  /t/  was  used:  a  relay  operated  by  a  one-shot  multivibrator 
switched  the  output  of  the  6,400  c.p.  s.  filter  either  to  the  /s/  or  to  the  /z/  multiplier 
circuit,  depending  on  whether  the  duration  of  the  hiss  was  smaller  or  greater  than  that  of 
the  period  of  the  multivibrator.  The  duration  of  the  multivibrator  cycle  was  about  130  msec. 
The  duration  of  the  hiss  then  helped  to  distinguish  between  three  phonemes:  the  /s/,  /z/ 
and  /t/.  If  the  duration  of  the  friction  was  60  msec,  or  less  then  a  /t/  was  recognised,  if 
it  was  between  60  and  130  msec,  a  / z/  and  if  it  was  more  than  130  msec,  an  /s/  was  indicated. 

The  circuits  for  acoustic  recognition  could  have  been  irrproved  in  a  number  of  different 
ways:  a  better-designed  bank  of  filters,  more  inputs  to  the  multipliers  to  allow  for  more 
complex  spectral  reference  patterns,  the  use  of  a  certain  amount  of  memory  to  detect  relative 
values  of  the  duration  and  intensity  of  successive  segnents  and  to  detect  formant  changes 
(for  the  recognition  of  transitions),  etc.  It  has  already  been  indicated  earlier,  however, 
that  there  was  no  intention  of  pursuing  the  question  of  acoustic  recognition  further  than 
was  necessary  to  achieve  reasonable  success  so  that  the  effects  of  using  linguistic  informa¬ 
tion  could  be  investigated.  The  acoustic  recogniser  was  therefore  not  developed  beyond  the 
circuits  already  described  and  the  following  paragraphs  will  deal  with  the  circuits  which 
implement  the  linguistic  part  of  the  recogniser. 

TOE  STORAGE  AND  USE  Of-  LINGUISTIC  INFORMATION 

The  linguistic  information  that  was  to  be  used  in  the  recognition  process  consisted 
of  the  digram  frequencies  of  the  phonemes;  at  any  point  in  the  recognition  process  there¬ 
fore  information  had  to  be  available  about  the  probability  of  occurrence  of  the  various 
phonemes  in  the  repertory  of  the  machine  as  a  function  of  the  preceding  phoneme.  This 
meant  that  the  digram  frequency  of  till  possible  phoneme  combinations  had  to  be  stored  in 
the  machine:  if  there  were  n  phonemes  then  there  would  be  digram  frequencies  to  store. 

As  the  recognition  proceeded  phoneme  by  phoneme  the  previously  recognised  phoneme  had  al¬ 
ways  to  be  remembered  and  in  the  light  of  its  identity  the  linguistic  store  had  to  provide 
a  different  set  of  n  digram  frequencies  for  use  in  the  recognition  process.  This  means 
that  two  separate  memories  were  needed  for  using  the  linguistic  information:  a  permanent 
one  which  remembered  all  n  digram  frequencies  and  a  continuously  changing  one  which  re- 
meirbered  the  identity  of  a  phoneme  recognised  for  the  duration  of  the  next  phoneme  only 
and  then  remembered  the  identity  of  this  fresh  phoneme  for  the  duration  of  the  one  after 
that  and  so  on;  this  second  memory  determined  which  set  of  n  digram  frequencies  appeared 
at  the  output  of  the  linguistic  store. 

The  digram  frequencies  are  remembered  in  the  form  of  the  setting  of  potentiometer 
sliders:  when  a  fixed  voltage  is  applied  across  the  potentiometer  the  slider  provides  a 
voltage  proportional  to  the  digram  frequency.  The  complete  store  consists  of  n^  potentio¬ 
meters,  one  for  each  of  the  nr  different  digram  frequencies  to  be  remembered.  The  poten¬ 
tiometers  are  arranged  in  n  columns  of  n  potentiometers  each.  The  ends  of  the  potentio¬ 
meters  in  any  one  column  are  connected  together  and  the  slider  of  each  one  is  set  propor¬ 
tional  to  the  digram  of  frequency  of  the  n  different  phonemes  following  one  of  the  phonemes. 

When  a  voltage  is  applied  to  the  commoned  ends  of  a  column  of  potentiometers,  the  slicfers 
will  provide  voltages  that  are  proportional  to  the  digram  frequencies.  There  are  n 
similar  columns,  one  for  each  phoneme  or  rather  one  for  each  set  of  digram  frequencies 
following  these  phonemes.  Part  of  such  a  matrix  of  potentiometers  is  shown  in  Fig.  21. 

If,  for  instance,  the  previous  phoneme  recognised  has  been  /m/,  then  e,  fixed  voltage  is 
applied  to  the  /m/  column  and  the  sliders  provide  a  set  of  voltages  proportional  to  the 
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multipliers 


Fig.  21.  A  typical  section  of  the  "store  of  linguistic 
knowledge"  circuit. 

digram  frequencies  of  the  various  phonemes  following  /m/ .  These  voltages  are  then  led  by 
the  common  rails,  shown  horizontally,  to  the  recogniser  circuits  and  are  used  in  ways  to 
be  described  later.  The  diodes,  shown  in  the  circuit  diagram,  prevent  the  loading  of  the 
output  voltages  by  the  potentiometers  not  energised:  all  these  voltages  are  positive  so 
that  current  can  flow  from  the  potentiometers  to  the  horizontal  rails  but  not  in  the 
opposite  direction.  The  actual  memory  consisted  of  256  potentiometers  assembled  into  a 
6  by  16  matrix  which  allowed  memory  space  for  remembering  the  digram  frequencies  of  as 
many  as  16  phonemes,  although  the  full  capacity  was  never  used.  The  resistance  of  each 
potentiometer  was  10  K  O  and  the  slider  was  adjusted  by  a  screwdriver.  Two  rotary  switche 
and  a.  voltmeter  were  al  so  provided  on  the  same  panel  as  the  potentiometers.  One  of  the 
switches,  marked  “input*,  could  permanently  switch  a  fixed  D.C.  voltage  to  any  column  of 
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potentiometers  overriding  the  control  by  the  maximum  detector  and  relay  memory.  The 
other  switch,  marked  -output*,  could  connect  the  voltmeter  to  any  of  the  horizontal 
rails  of  the  potentiometer  matrix  which  are  in  turn  connected  to  the  sliders.  Ibis 
arrangement  made  it  possible  to  adjust  the  setting  of  each  slider  at  leisure  and 
could  also  be  used  to  operate  the  recogniser  without  using  linguistic  information  by 
permanently  connecting  the  fixed  D.C.  voltage  to  one  of  the  columns  and  setting  all 
the  sliders  in  that  column  to  the  same  position* 

This  store  of  information  about  digram  frequencies  has  to  be  operated  by  another 
memory jvhich  remembers  the  identity  of  the  previous  recognition.  It  is  this  “phoneme 
memory  that  provides  the  voltage  which  is  applied  to  the  appropriate  column  of 
potentiometers  in  the  memory  of  digram  frequencies.  The  phoneme  memory  is  a  short¬ 
term  one  and  has  to  operate  in  conjunction  with  the  maximum  detector  already  mentioned. 
It  must  memorise  the  identity  of  whichever  phoneme  is  selected  by  the  maximum  detector, 
produce  this  information  as  soon  as  the  maximum  detector  has  changed  over  to  another 
phoneme  and  then  continue  to  produce  this  information  until  the  maximum  detector  makes 
a  new  selection.  At  any  one  time,  therefore,  the  phoneme  memory  must  be  capable  of 
storing  information  about  the  phoneme  recognition  being  made  currently,  but  give  no 
corresponding  output,  and  also  give  an  output  indicating  the  identity  of  the  previous 
phoneme  recognition.  The  changeover,  when  the  existing  output  is  to  be  discarded  to 
be  replaced  by  information  about  the  phoneme  which  has  been  stored,  takes  place  when¬ 
ever  the  maximum  detector  changes  from  one  selection  to  another. 


The  circuit  developed  for  this  purpose  consists  of  a  number  of  identical  units, 
one  unit  for  each  phoneme  being  dealt  with  by  the  recogniser,  and  each  unit  is  asso¬ 
ciated  with  a  different  one  of  the  relays  in  the  anodes  of  the  maximum  detector.  The 
circuit  diagram  of  one  of  these  memory  units  is  shown  in  Fig.  22.  In  the  quiescent 
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Fig.  22.  Circuit  diagram  of  a  typical  phoneme  oetory  unit. 

state  all  the  relays  are  de-energised.  Relay  A  is  the  relay  of  the  maximum 
detector  circuit.  When  it  energises,  indicating  that  the  maximum  detector  has 
selected  that  particular  phoneme,  the  o  contacts  operate  relay  B  which  is  self¬ 
holding  through  the  b  contacts.  The  operation  of  the  B  relay  constitutes  the 
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memorising  action  which  will  maintain  itself  even  after  A  releases.  When  relay  A  does 
release,  indicating  that  a  fresh  maximum  has  appeared  and  has  been  selected  by  the 
operation  of  another  A  relay  in  the  maximum  detector,  then  the  a  contacts  return  to 
position  1  and  the  50  volt  H.T.  supply  is  applied  to  the  line  marked  "typewriter  out¬ 
put"  through  position  3  of  the  6  contacts  and  position  1  of  the  a  contacts.  This 
will  operate  the  typewriter  and  a  character  is  printed  identifying  the  phoneme  recog¬ 
nised;  this  means  that  the  typewriter  indicates  a  recognition  at  the  moment  when  the 
actual  recogniser  has  started  the  next  recognition.  The  50  volt  supply  also  operates 
relay  D,  the  contacts  of  which  apply  the  fixed  voltage  to  the  appropriate  column  of 
potentiometers  in  the  digram  frequency  store.  Finally  the  50  volt  supply,  on  being 
applied  to  the  “typewriter  output’  line,  will  also  trigger  the  one-shot  multivibrator 
circuit  of  valves  Vj  and  Vj>  and  relay  C  operates  for  the  duration  of  the  cycle  of 
this  multivibrator.  Position  2  of  the  c  contacts  of  all  the  memory  units  are  commoned 
to  the  line  marked  "cancel";  although  this  means  that  the  points  X  on  all  the  units 
are  connected  in  parallel,  the  50  volts  supplied  by  the  a  or  6  contacts  of  any  one 
unit  will  still  only  affect  the  B  relay  of  their  own  unit  because  of  the  diode  Rj. 

When  the  c  contacts  of  one  of  the  units  change  over  from  position  1  to  3,  the  “cancel” 
line  is  shorted  to  earth  and  the  B  relay  of  any  unit  that  might  have  been  energised  is 
released,  except  for  the  one  in  which  the  C  relay  has  operated  because  in  that  unit 
the  shorting  action  of  the  c  contacts  has  at  the  same  time  disconnected  that  B  relay 
from  the  cancel  line.  The  diode  is  biased  in  the  forward  direction  now  and  will 
not  impede  the  shorting  action.  In  this  way  the  memory  of  all  the  units  is  cleared, 
except  the  one  that  has  been  operated  immediately  before.  The  multivibrator  and 
relay  C  will  operate  for  only  a  short  time  when  relay  A  releases;  relay  D  however 
will  remain  operated  as  long  as  relay  B  is  energised.  Relay  B  remains  energised  until 
the  A  relay  in  another  unit  releases.  This  will  operate  the  C  relay  of  that  unit  and 
the  shorting  action  of  its  contacts  will  cancel  the  B  relay  of  the  first  memory  unit. 
Once  the  coil  of  the  B  relay  is  shorted  its  contacts  release  and  the  first  unit  is 
returned  to  its  quiescent  state  and  is  ready  for  another  operation.  Summarising 
briefly,  then,  the  A  relay  activates  the  memory  unit  and  the  B  relay  retains  the 
memory  without,  at  this  time,  giving  an  output.  When  the  A  relay  releases,  the  C 
relay  cancels  all  previous  memories  but  retains  its  own,  and  the  D  relay  activates 
the  relevant  part  of  the  memory  of  digram  frequencies. 

All  the  relays  in  these  circuits  are  of  the  high-speed  type  that  operate  and 
release  in  less  than  2  msecs,  and  the  speed  of  operation  of  the  circuit  is  controlled 
by  the  duration  of  the  multivibrator  cycle  which  is  purposely  made  as  long  as  about 
35  msec.  This  seems  a  long  period  but  it  must  be  remembered  that  at  the  rate  of 
speaking  used  in  the  speech  input  to  be  recognised  the  average  duration  of  phonemes 
is  about  300  msec,  and  the  duration  of  all  but  a  few  phonemes,  plosives  for  example, 
is  more  than  170  msec.  Most  of  the  recogniser  circuitry  is  rendered  inoperative 
during  the  time  that  any  one  of  the  C  relays  is  energised  because  the  c  contacts  short 
the  common  "cancel”  line  to  earth  and  prevent  the  operation  of  another  B  relay  during 
this  interval.  This  period  of  immobility  is  needed  to  prevent  undesired  operation  of 
the  recogniser  circuits  owing  to  a  variety  of  reasons.  For  instance,  immediately 
after  the  change-over  some  time  is  needed,  because  of  relay  and  multiplier  circuit 
time  constants,  before  the  fresh  information  about  digram  frequencies  is  fully 
effective  and  the  wrong  maximum  might  be  selected  during  this  interregnum.  Also  as 
is  well-known,  formant  transitions  take  place  at  the  phoneme  boundaries  which  might 
give  rise  to  spectral  patterns  of  short  duration  that  match  quite  different  spectral 
reference  patterns  from  those  matched  best  by  the  succeeding  steady  state ' segment. 

The  nature  of  these  transitions  provides,  of  course,  valuable  cues  for  recognition 
but  they  are  not  used  in  the  system  of  recognition  discussed  here  and  therefore  it  is 
preferable  to  exclude  their  effects.  These  and  other  considerations  made  it  necessary 
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to  set  tbe  time  constant  of  the  multivibrator  to  35  msec,  and  thereby  to  exclude 

recognition  by  mean ,  of  pattern  matching  of  any  event  with  a  shorter  duration  than 
this  period. 


The  foregoing  paragraphs  have  explained  how  the  information  about  digram  freruen- 
des  was  stored  and  made  available  for  use  at  the  right  time.  The  information  thus 
provided  was  applied  to  the  recognition  process  by  a  further  stage  of  multiplication. 
Die  multipliers  of  the  acoustic  recogniser  provide  a  set  of  voltages,  one  voltage  for 
each  phoneme,  showing  how  well  the  input  wave  corresponds  with  the  phonemic  reference 
patterns  or  in  other  words  showing  the  relative  likelihood  of  occurrence  of  the 
phonemes  from  the  acoustic  point  of  view.  At  the  same  time,  another  set  of  voltages 
is  also  available  from  the  store  of  linguistic  knowledge,  again  one  voltage  for  each 
phoneme,  which  are  an  expression  of  the  likelihood  of  occurrence  of  the  various 
phonemes  from  the  linguistic  point  of  view.  These  two  streams  of  information  are  com¬ 
bined  by  multiplying  separately  for  each  phoneme  the  acoustically  derived  voltage  with 
the  corresponding  voltage  from  the  linguistic  store  and  then  selecting  the  largest 
product.  This  means  that  for  each  phoneme  two  multiplications  are  carried  out  prior 
to  maximum  selection.  The  two  filter  outputs  are  multiplied,  as  explained  previously, 
for  acoustic  recognition  and  the  product  is  then  multiplied  with  the  voltage  represent¬ 
ing  the  appropriate  digram  frequency  and  this  second  product  is  applied  to  the  maximum 
detector.  In  constructing  the  recogniser  two  identical  multiplier  circuits  were 
mounted  on  a  common  sub-chassis  to  take  care  of  this  double  multiplication  process 
for  one  phoneme  and  there  were  as  many  of  these  double  multiplier  circuits  as  there 
were  phonemes  to  be  recognised  by  pattern  matching.  A  schematic  diagram  of  the 
arrangement  for  the  complete  recogniser  is  shown  in  Fig.  23.  Each  of  the  boxes  marked 
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dUrran  showing  the  arrangement  by  which  acoustic  and 
linguistic  information  a re  combined  in  the  recogniser^ 
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* multiplier  *  multiplies  three  voltages,  two  from  the  filters  and  one  from  the 
‘store  of  linguistic  knowledge*.  All  the  multipliers  work  simultaneously  and 
the  ‘maximum  detector"  identifies  that  ‘multiplier"  which  provides  the  largest 
product.  The  output  of  the  maximum  detector  operates  the  phoneme  memory  and 
the  typewriter  (not  shown  in  Fig.  23). 

THE  OPERATION  OF  THE  TYPEWRITER  AND  THE  TYPERRITER  MEMORY 

Next,  the  arrangement  by  which  the  appropriate  key  of  the  typewriter  is 
operated  will  be  described.  An  Underwood  electric  typewriter  was  used  with 
the  recogniser.  Unfortunately  no  electrical  typewriter  exists  in  which  the 
keys  are  operated  by  electrical  action:  they  are  all  mechanically  operated 
and  are  ‘electrical”  only  in  the  sense  that  the  initial  movement,  which  must 
be  produced  mechanically,  triggers  off  the  movement  of  the  type  bar,  this 
latter  movement  deriving  its  energy  from  an  electrically  driven  roller.  A 
set  of  solenoids  had  to  be  added  therefore  to  the  typewriter  to  provide  the 
initial  movement  of  the  individual  keys.  The  solenoids  were  also  supplied  by 
Underwoods.  It  was  recommended  that  the  solenoids  should  be  mounted  under¬ 
neath  the  keyboard  and  arranged  to  pull  the  keys  downwards.  In  the  end  it 
was  found  more  convenient  to  mount  the  solenoids  above  the  keyboard  and 
arrange  that  the  plungers  of  the  solenoids  push  the  keys  downwards.  This 
avoided  the  need  for  hooking  each  plunger  to  its  appropriate  key-bar  as  would 
have  been  necessary  if  the  solenoids  had  been  mounted  underneath  the  key¬ 
board.  The  plungers  are  returned  to  their  normal  position  by  the  return 
springs  of  the  typewriter  keys.  These  springs  had  to  be  reinforced  by  phosphor- 
bronze  sheet  arranged  to  press  the  keys  upwards.  A  fresh  set  of  typewriter 
key  tops  were  fixed  to  the  tops  of  the  solenoid  plungers  so  that  the  type¬ 
writer  can  be  operated  by  hand  as  well  as  by  actuating  the  solenoids. 

The  solenoids  needed  a  fairly  high  current  if  they  were  to  operate  suf¬ 
ficiently  fast;  at  the  same  time  the  total  energy  consumption  had  to  be 
limited  to  avoid  overheating.  Both  requirements  could  be  satisfied  by  opera¬ 
ting  the  solenoids  with  a  measured  amount  of  energy  from  a  condenser  charged 
to  a  high  voltage.  The  contacts  of  the  maximum  detector  relays  connected  a 
64  uF  condenser  charged  to  300  volts  across  the  solenoid;  when  the  relay  de¬ 
energised  the  condenser  was  recharged  by  being  connected  to  a  300  volt  D.C. 
line.  The  amount  of  energy  stored  in  the  64  uF  condenser  was  sufficient  to 
move  the  plunger  of  the  solenoids  and  to  operate  the  typewriter.  The  re¬ 
sistance  of  the  solenoids  was  100  ohms  and  the  peak  current  flowing  was  there¬ 
fore  around  3  amps.  Unfortunately  this  was  too  high  for  the  contacts  of  the 
high  speed  relays  that  are  used  in  the  maximum  detector;  no  relay  with  con¬ 
tacts  having  a  sufficiently  high  current  rating  and  at  the  same  time  operat¬ 
ing  at  the  high  speeds  required  could  be  found  and  therefore  the  solenoids 
were  operated  by  a  suitably  triggered  thyratron.  The  thyratron  circuit  will 
be  described  later,  after  another  modification  of  the  circuitry  for  operating 
the  typewriter  has  been  discussed. 

The  maximum  rate  at  which  the  typewriter  can  type  is  about  15  characters 
per  second.  The  average  rate  at  which  phonemes  succeed  each  other  in  the 
speech  material  used  for  testing  the  recogniser  was  only  about  one  per  second 
if  the  interval  between  words  is  taken  as  part  of  the  speech  and  still  only 
about  9  per  second  during  the  words  themselves.  Despite  these  low  average 
rates,  the  peak  rate  was  considerably  higher  and  when  watching  the  typewriter 
during  the  operation  of  the  recogniser  it  became  clear  that  it  failed  to 
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type  some  of  the  phonemes  selected  by  the  recogniser:  on  some  occasions  the 
solenoid  plungers  moved  but  not  the  type,  on  other  occasions  several  type 
bars  operated  in  quick  succession,  became  entangled  and  jammed  the  typewriter. 
The  peak  rate  at  which  phonemes  were  being  recognised,  which  was  later  found 
to  be  about  20  to  25  per  second,  was  therefore  too  high  for  the  typewriter, 
whilst  the  average  rate  was  well  within  its  capabilities.  It  was  thought 
possible  therefore  to  overcome  this  difficulty  by  using  an  “information  rate 
smoother”,  a  device  which  accepts  and  stores  the  phoneme  recognitions  at 
whatever  rate  they  occur  and  then  “reads”  them  into  the  typewriter  at  a  con¬ 
stant,  slower  rate.  This  rate  must  be  less  than  the  maximum  of  which  the 
typewriter  is  capable  and  as  long  as  the  output  rate  is  higher  than  the 
average  rate  of  phoneme  recognition  the  system  will  work  satisfactorily  with 
quite  modest  storage  requirements. 

A  storage  system  based  on  magnetic  tape  was  first  tried  but  was  not  a 
success  because  of  the  difficulty  of  controlling  the  movement  of  the  tape  in 
a  satisfactory  manner.  Next  a  system  in  which  a  bank  of  condensers  was  used 
as  the  information  store  was  tried  and  was  found  to  be  satisfactory.  In  this 
system,  each  phoneme  was  designated  by  a  binary  number  and  as  all  44  keys  of 
the  typewriter  were  to  be  catered  for  a  six-digit  binary  code  was  used. 
Whenever  a  phoneme  was  recognised  the  corresponding  binary  number  was  pro¬ 
duced  by  means  of  a  simple  diode  coding  network  shown  in  Fig.  24.  A  diode 


Fig.  2U.  Example  of  coding  network. 

The  50  volt  supply  is  obtained  from  the 
"typewriter  output"  line  of  the  phoneme 
memory  (fig.  22.)  and  the  single  switch 
on  this  diagram  represents  the 
combination  of  the  contacts  of  relay  A 
(off)  and  relay  B  (on)  of  fig.  22. 


was  connected  in  whichever  position  a  binary  one  was  needed  and  the  diode  was 
omitted  for  a  binary  zero.  There  were,  of  course,  as  many  different  coding 
panels  as  there  were  symbols  to  be  typed.  The  binary  number  corresponding  to 
each  phoneme  recognition  was  stored  in  the  form  of  unit  charges  on  a  number 
of  condensers.  As  a  six-digit  code  was  used,  six  condensers  were  needed  to 
store  one  number.  Altogether  25  rows  of  6  condensers  each  were  provided  so 
that  25  successive  binary  numbers  could  be  stored.  One  end  of  all  condensers 
was  connected  together  and  earthed.  The  other  ends  were  connected  to  two  25 
position  multi-level  uniselectors  in  such  a  way  that  any  one  condenser  was 
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connected  to  the  two  corresponding  positions  of  the  two  uniselectors.  The  six  con¬ 
densers  storing  one  binary  number  were  connected  to  six  different  levels  of  the  same 
position  on  the  uniselectors,  the  next  six  condensers  to  the  six  levels  of  the  next 
position,  and  so  on.  The  six  wipers  of  the  first  uniselector,  called  the  “write" 
uniselector,  receive  the  six  voltages  (or  lack  of  voltage)  representing  the  six  digits 
of  the  binary  number  to  be  stored  and  charge  the  six  condensers,  in  the  position  to 
which  the  wipers  are  connected,  accordingly.  The  wipers  then  step  to  the  next  position 
and  the  next  binary  number  is  stored  and  so  on.  The  “write”  uniselector  is  stepped 
by  the  cancelling  pulse  of  the  phoneme  memory  (Fig. 22)  so  that  it  steps  every  time  a 
new  recognition  is  made.  The  stepping  of  the  wipers  of  the  second  uniselector,  called 
the  “read"  uniselector,  is  determined  by  the  frequency  of  a  multivibrator  provided  for 
this  purpose;  this  frequency  was  adjustable  and  was  set  to  approximately  2  c.p.s.  A 
seventh  level  on  both  write  and  read  uniselectors  was  used  to  ensure  that  the  wipers 
of  the  read  uniselector  could  never  pass  those  of  the  write  uniselector.  Whenever  the 
"read"  wipers  get  within  2  steps  behind  the  “write"  wipers  the  interlock  prevents  any 
further  movement  of  the  “read"  wipers  until  the  “write"  switch  has  started  moving  again. 
A  simplified  form  of  the  circuit  described  so  far  is  shown  in  Fig. 25.  It  will  be  seen 


7iS.  Circuit  diasrat  of  "write-'  and  "read"  uniselector  siriteh  connections. 


that  transistors  are  used  rather  than  valves.  This  achieved  the  well-known  savings  in 
space  required,  heat  dissipated,  wiring  needed  and  in  the  size  and  complexity  of  the 
power  pack;  the  rest  of  the  recogniser  circuitry  used  valves  because  transistors 
were  not  yet  freely  available  at  the  time  of  its  design.  In  the  circuit  of  Fig.  25, 
the  negative-going  pulse  on  the  “cancel*  line  of  the  phoneme  memory  is  used  to  drive 
the  “write"  uniselector.  The  pulse  drives  relay  7,  the  contacts  of  which  energise 
the  coils  of  the  uniselector  switch  and  of  relay  8,  which  in  turn  operates  relays  1 
to  6.  The  pulse  of  the  “cancel"  line  lasts  35  msec,  and  appears  every  time  the 
phoneme  memory  indicates  a  fresh  phoneme  recognition.  The  six  voltages  representing 
the  binary  number  which  stands  for  the  phoneme  just  recognised  appear  simultaneously 
with  the  beginning  of  the  “cancel"  pulse.  They  will  charge  the  6  condensers  through 
the  contacts  of  relays  l  to  6  and  the  wipers  of  the  uniselector.  The  time  constant 
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of  the  condenser  charge  and  discharge  , 

previous  charge  on  any  of  the  condensers  !  V'  ^  “  feW  mlcrosec°nds,  so  that  any 
establish  themselves  L  a  relativ^!  u  disappear  and  the  new  charges  will  * 

-cancel-  pulse,  relay  7  reUaseJ  J/XL  U  ^  *  the  eDd  .of  the  35  — «• 

of  the  uniselector.  The  de-energisine  of  th*  e‘®nergise  the  coils  of  relay  8  and 
move  forward  by  one  step  and  the  r  1  r  “nlselector  «»1  will  make  the  wipers 

RL6,  isolatingLhTc^d  ns  s  i  ich  :m  hold"  heir  ^  * 

of  minutes.  The  circuit  drivin^tL  co  1  oj  tL  .Su£.flcl“t.1y  for  a  ««« 

right-hand  end  of  Fit  2S  A  ^  i  r  *  •  d  unlselector  is  shown  at  the 

at  about  2  n  ?'•.  *  A  slmPle>  free-running  multivibrator  is  used.  It  oscillates 

The  con  t  acts  *  of3*  ,  L  can  be  ^justed  by  the  50  K  ohm  potentiometer. 

2d  stTZ  L  ^  Cl°Se  8t  the  ^uency  of  the  multi-vibraLr 

-read-  swLch  c  8  “^elector  at  this  rate.  The  contacts  ensuring  that  tie 

read  switch  can  never  catch  up  on  the  ‘write'  switch,  as  explained  earlier  are 
also  shown,  as  well  as  another  set  of  contacts  which  prevent  the  ‘read"  uniselector 
ll°o;  thC  tyPBWriter  ^  ^e  carriage  return  period  and  make  it 

The  output  voltages  from  the  wipers  of  the  "read"  uniselector  are  decoded,  that 
is  the  correct  typewriter  solenoid  to  be  operated  is  selected,  by  routing  the  solenoid 
operating  current  along  the  branches  of  a  relay  tree  whose  settings  are  determined  byd 
the  condenser  charges.  A  simplified  circuit  diagram  is  shown  in  Fig.  26.  The  wipers 


r* 

f  •  clrc 


to 

RL  •  "  fh^rmtron 
r  •  circuit 

I  <n«.  27 i 


Input  fro* 
•Read’ 

Dnia«l«etor 


— -fro.  <■  | 


5  I 

«*"►!  «• «  I 


To  Thyratron  ■ 

AH  tranaistoro  typo  0C71 


To 

.  Typewriter 
3ol«nold9 


five  circuits 


25*  «•*!«•  ot  circuit  for  opcrctlng  typewriter  eolenolds  from  blnery  coded  control  Input  eoltege. 

of  the  “read"  uni-selector  are  connected  to  terminals  1  to  6  on  the  left.  A  voltage 

I?!  r/Xa?P1!’  WiperN°'  1  wili  trl8ger  the  fHp-flop  circuit  of  transistors  T, 

2  an  will  de-energise  relay  1.  The  same  happens  to  any  of  the  other  five  similar 
flip-flop  circuits,  if  the  wipers  to  which  they  are  connected  carry  a  binary  ‘one' 
voltage.  The  flip-flops  remain  quiescent  if  the  wipers  carry  a  binary  ‘zero’*  voltage. 
The  operation  of  relays  1  to  6  energises  relays  10  to  60  and  the  contacts  of  these 
atter  relays  form  the  binary  decision  tree  that  connects  the  typewriter  solenoids  to 
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Pr°"?r  °*  ""rgy  f°r  th.  .ol.noid.,  Belays  10  to  60. 

on. M,  „?  it  y‘  "  ClrCU,t’  •"  «*  ">«  high* speed  y.ri.ty  .hioh  o.rty  only 

one  set  of  change-over  contacts  so  that  although  only  a  single  relay  is  shown  fLv 

a  separate  relay  has  to  be  provided  for  each  set  of  contacts.  Summarizing  the  opera- 

tr1S<ters  theCmoUltflon‘CrihedK80  f"*  digit  binary  from  ^uniselector 

Ilf  flip-flops,  which  operate  relays  1  to  6  which  in  turn  operate  relays  10 

to  60  and  the  contacts  of  these  pre-select  one  particular  solenoid  to  which  the  current 
to  be  supplied  by  the  thyratron  will  be  sent.  The  thyratron  circuit,  to  be  describH 
later,  is  triggered  by  the  two  flip-flops  of  transistors  T,  T  T  T  tv,  .  d  . 

"'r*  “ 6  ■*“  -?  ?*»  pH.pr.i.y  s  L4„i5.TSig^inop"i  r of 

through  one  or  more  of  diodes  Dx  to  Dg  to  the  flip-flop  of  T,  and  T*.  RelavVin  the 

cyaeCofrtherflip-flfJ3  1“n°rmally  «H  release  for  the  duration  of  the 

X 1?  t  thC  *  P.P'  The  contacts  of  relay  7  normally  short  the  0.01  uF  condenser 
in  the  base  circuit  of  T5  to  earth.  When  relay  7  operates,  the  condenser  will  charge 
to  a  negative  voltage  but  will  not  affect  T5  because  of  the  shorting  diode  D,.  When 

dilcharaeH  ***  °f  ^  CyCle  °f  flip*floP  *3  T4,  fhe  condSser  is 

harged  to  earth  and  sets  up  a  positive  pulse  across  the  47  Rohm  resistance  in  the 

base  circuit  of  T5;  this  triggers  the  flip-flop  of  transistors  Tr  and  Tfi.  Relay  8 

of  ee  r!  \  normaiiy  °perated  ^ wiu  ^ 

of  the  duty  cycle  of  the  flip-flop  Tg  Tg.  This,  as  will  be  seen  below,  triggers  the 
thyratron  and  an  energising  current  is  sent  along  the  relay  tree  to  the  appropriate 

IZ':V,T  ,liP:fl°P  T3  T*  r.l.y  7  «d  del.,,  fhrttigg.r. 

mg  of  flip-flop  Tg  Tg  for  a  time  span  of  20  msecs.  The  double  purpose  of  this  inter- 

aS  t0La“0W  Plenfcy  °f  ^  time  for  the  contacts  of  relays  10  to  60  to  select  the  de¬ 
sired  path  for  the  solenoid  operating  current  before  relay  8  triggers  the  thyratron 
and  to  ensure  that  the  relay  contacts  do  not  themselves  switch  the  solenoid  current. 

The  contacts  of  relay  8  remain  open  for  30  msec,  as  required  by  the  thyratron  circuit 
and,  as  will  be  explained  later,  the  current  from  the  thyratron  will  cease  not  later 
than  a  further  10  msec.  The  flip-flop  Tj  T2.  which  initiated  the  cycle,  keeps  relays 
1  to  6  and  10  to  60  operated  for  200  msec,  and  therefore  at  least  140  msec,  must 
elapse  after  the  solenoid  current  ceases  before  the  contacts  of  the  relay  tree  change 
again.  This  means  that  the  contacts  of  the  relay  tree  do  not  either  break  or  make 
the  relatively  high  solenoid  operating  current;  this  feature  of  the  design  is  import¬ 
ant  to  prevent  early  damage  to  the  relay  contacts.  At  the  end  of  the  200  msec,  oper¬ 
ating  period  of  relays  1  to  6  the  circuit  is  ready  to  receive  the  next  input,  derived 
from  the  wipers  of  the  “read”  uniselector  after  it  has  stepped  again. 


Tne  current  for  operating  the  solenoids  is  switched  on  and  off  by  a  suitably 
triggered  thyratron.  As  already  mentioned,  the  solenoids  need  an  operating  current 
of  about  2  amps  but  this  value  of  current  must  not  be  maintained  for  longer  than 
about  10  msec.,  to  prevent  over-heating.  It  was  not  found  easy  to  obtain  a  mechanical 
switch  that  combined  the  fast  operation  with  the  necessary  current-carrying  capacity 
and  that  is  why  a  thyratron  was  used:  the  thyratron  was  triggered  to  connect  the 
solenoids  to  the  mains  supply  for  one  half-cycle  of  the  mains  voltage.  The  necessary 
circuit  is  shown  in  Fig.  27.  The  solenoid  selected  by  the  relay  tree  is  connected  to 
the  mains  voltage  through  the  thyratron;  normally  the  solenoid  is  isolated  but  when 
the  thyratron  fires  the  whole  of  the  mains  voltage,  less  the  relatively  small  thyratron 
maintaining  voltage,  is  connected  across  the  solenoid  and  a  peak  current  of  2  to  3 
amps  will  be  obtained.  The  grid  voltage  of  the  thyratron  is  obtained  by  combining  a 
D.C.  voltage  with  a  50  c.p.s.  A.C.  voltage.  The  phase  of  the  A. C.  grid  voltage  leads 
the  anode  voltage  by  about  22°;  the  phase  shift  is  obtained  by  the  0.025  uF  condenser 
and  300  K  ohm  resistance.  As  a  result  of  this  phase  shift,  the  grid  voltage  will 
reach  its  most  positive  value  before  the  middle  of  the  positive  half  cycle  of  the 
anode  voltage.  The  D.C.  voltage,  with  negative  polarity  of  course,  is  obtained  from 
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Fig.  27.  Circuit  diagram  of  thyratron  unit  for  enargisinz 
tha  typewriter  solenoids. 

the  rectifier  Rj  and  the  8  uF  reservoir  condenser.  In  the  normal,  inoperative  condi¬ 
tion,  the  contacts  of  relay  8  in  Fig.  26  short  out  one  of  the  62  K  ohm  resistors,  as 
shown  in  Fig.  27.  Under  these  conditions  the  D.C.  voltage  at  the  grid  of  the  thyratron 
is  about  33  volts  with  an  A.C.  voltage  of  15  volts  peak  value  added  to  it.  The  tri*- 
gering  voltage  of  the  thyratron,  for  the  peak  anode  voltage,  is  around  5  volts  and 
therefore  the  thyratron  cannot  strike.  When  relay  8  of  Fig.  26  releases  and  remains 
open  for  30  msecs.,  the  negative  D.C.  voltage  at  the  grid  falls  to  about  20  volts  and 
the  instantaneous  voltage  at  the  grid  will  fall  to  around  -5  volts  during  the  positive 
peaks  of  its  A.C.  components.  As  a  result  the  thyratron  will  fire  when  the  anode 
voltage  is  near  its  peak  value.  The  exact  conditions  of  firing  will  depend  on  the 
instant  of  operation  of  relay  8  (Fig.  26)  in  the  A.C.  cycle.  Two  things  must  be  re¬ 
membered  in  this  connection.  First,  the  duration  for  which  the  contacts  of  relay  8 
are  open  is  30  msec,  that  is  154  cycle  of  the  A.C  voltage  and  second,  the  thyratron 
extinguishes  during  the  negative  half-cycle  of  the  anode  voltage  and  cannot  fire  in 
two  successive  positive  half-cycles  of  the  anode  voltage  because  of  the  action  of  re¬ 
lay  B.  Relay  B  will  operate  as  soon  as  the  thyratron  fires  but  will  not  release  for 
about  30  or  40  msec.,,  because  the  2  uF  condenser  can  charge  rapidly  through  the  recti¬ 
fier  R2  but  must  discharge  slowly  through  the  high  resistance  coil  of  relay  B.  Ihe 
contacts  of  B  will  short  the  62  K  ohm  resistance  and  raise  the  negative  grid  voltage 
again.  ’Riis  will  not  affect  the  thyratron  if  it  has  already  fired  but  will  prevent  a 
second  firing  during  the  next  positive  half-cycle,  even  if  the  contacts  of  relay  8 
(Fig.  26)  are  still  open.  During  normal  operation,  then,  if  the  contacts  of  relay  8 
open  during  the  first  half  of  the  positive  half -cycle  of  the  anode  voltage  then  the 
thyratron  will  fire,  then  extinguish  at  the  end  of  the  positive  half-cycle  and  will 
not  fire  again  during  the  next  positive  half-cycle,  even  though  the  contacts  of  relay 
8,  which  remain  open  for  1  'A  cycles,  will  probably  still  be  open.  If  the  contacts  of 
relay  8  open  during  the  second  half  of  the  positive  half-cycle  of  the  anode  voltage, 
the  thyratron  will  not  fire  because  the  A.C.  component  will  have  made  the  grid  voltage 
too  negative.  The  thyratron  will  then  fire  during  the  second  positive  half-cycle 
of  the  anode  voltage  because  the  contacts  of  relay  8  remain  open  for  30  msec.  If  the 
contacts  of  relay  8  first  close  during  the  negative  half-cycle  of  the  anode  voltage. 
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then  the  thyratron  will  fire  during  the  next  positive  half-cycle.  In  this  wav  the 
solenoids  receive  a  current  pulse  which  is  no  shorter  than  a  X  cycle  and  no  longer 
than  A  cycle  whatever  the  instant  at  which  the  contacts  of  relay  8  first  open.  The 
circuit  in  the  lower  part  of  the  diagram  shows  the  operation  of  the  thermal  delay 
switch  DS  which  ensures  that  current  cannot  be  sent  through  the  thyratron  until  its 
cathode  is  fully  heated.  The  function  of  self-holding  relay  A  is  to  disconnect  the 
thermal  delay  switch  as  soon  as  it  has  operated.  This  ensures  that  the  thyratron 
cannot  be  switched  on  again  without  awaiting  the  full  delay  period  if  the  supply 
voltage  is  switched  on  again  soon  after  it  has  been  switched  off. 


THE  POWER  PACKS 


The  only  parts  of  the  circuitry  that  have  not  yet  been  described  are  the  power 
packs.  The  current  and  voltage  requirements,  other  than  6.3  volt  50  c.p.s.  heater 
supplies,  are  set  out  in  Table  1.  As  a  general  rule,  stabilised  H.T.  supplies  were 
employed  throughout.  They  were  preferred  even  if  voltage  stability  was  not  of  primary 
importance  because  the  stabilisers  provided  low  hum  level  and  low  output  impedance. 

The  low  output  impedance  made  it  possible  to  supply  a  number  of  different  circuits 
from  the  same  H.T.  supply  without  danger  of  undesirable  coupling  through  the  output 
impedance  of  the  power  pack.  The  only  H.T.  supplies  which  were  left  unstabilised 
were  the  +275  volt  supply  for  the  filter  amplifier,  the  -700  volt  supply  for  the 
maximum  detector  (which  was  stabilised  with  a  neon  tube  in  the  maximum  detector  cir¬ 
cuit  itself)  and  the  30  volt  supply  for  energising  the  uniselector  magnets. 


The  power  pack  for  the  filter  amplifier  consisted  of  the  conventional  vacuun 
diode  full-wave  rectifier,  using  a  reservoir  condenser  followed  by  a  single  inductance 
capacitance  smoothing  stage.  The  negative  supply  for  the  maximum  detector  consisted 
of  a  vacuum  diode  half-wave  rectifier  and  a  reservoir  condenser.  The  uniselector 
supply  used  a  selenium  diode  bridge  rectifier  and  a  reservoir  condenser. 


The  stabilised  supplies  all  used  the  usual  full  wave  rectifier  -  reservoir  con¬ 
denser  arrangement  to  provide  the  H.T.  voltage.  The  stabilisers  were  of  three  differ¬ 
ent  kinds,  depending  on  the  voltage  required.  The  300  volt  and  200  volt  stabilisers 
used  the  conventional  series  valve  circuit  shown  in  Fig.  28.  Two  12E1  valves  are  used 


Tig.  28.  Simplified  circuit  dlagrma  of  300  volt  aorlcs  atabillaor. 
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duct!ncilelHt0  8CrVe  a\the  Ti6S  VBlVe  Vl'*  this  d°ubles  the  elective  mutual  con¬ 
ductance  and  improves  the  performance  of  the  circuit.  He  neon  stabiliser  which 

Lr°:o?eSf;he/,efrenCe.VOltage  iS  SUpplied  from  a  separate  source  so  that  its  output 
is  not  affected  by  variations  within  the  stabiliser  and  as  a  result  the  output  im- 
pedance  of  the  circuit  is  as  low  as  0.1  ohm  or  less,  from  1  c.p.s.  to  about  4  kc.p.s. 
and  is  still  only  0.3  ohm  at  10  kc.p.s.  The  output  impedance  rises  towards  the 
higher  frequencies  because  the  gain  of  V2  declines.  The  mains  ripple  on  the  output 
is  about  250  microvolts  peak.  The  output  voltage  remains  constant  to  about 
for  a  ±10%  mains  voltage  variation.  The  rectifier  cathodes  are  of  the  indirectly 
heated  type  to  ensure  that  all  the  valve  cathodes  in  the  circuit  are  fully  heated  by 
the  time  the  rectified  voltage  appears.  This  prevents  the  build-up  of  excessive 
voltages  across  some  of  the  valve  electrodes  and  across  the  smoothing  condensers. 

The  adjustable  screen  grid  voltage  of  Vj  helps  in  setting  the  output  impedance  and 
hum  to  a  minimum.  The  circuit  can  supply  up  to  200mA  at  300  V,  although  only  just 
over  100  mA  were  actually  required. 


The  same  type  of  stabiliser  was  used  for  the  200  volt  supply.  Almost  400  mA 
were  required  at  this  voltage  and  therefore  two  separate  stabilisers,  supplying 
about  200  mA  each  were  used. 


The  circuit  of  Fig.  28  is  not  very  suitable  for  providing  low  output  voltages. 
The  minimum  output  voltage  cannot  be  less  than  the  sum  of  the  anode  -  cathode  voltage 
required  to  operate  V2  and  of  the  grid-cathode  voltage  required  for  Vi.  In  practice 
the  minimum  voltage  that  can  conveniently  be  provided  by  the  circuit  is  about  200 
volts  and  a  different  stabiliser  was  used  for  the  other,  lower  supply  voltages  also 
needed  for  operating  the  recogniser.  The  stabiliser  used  for  the  50  volt  supply  is 
shown  in  Fig.  29;  the  circuits  for  obtaining  the  other  low  voltages  were  the  same 


Fig.  29.  Simplified  circuit  diagram  of  series  stabiliser  using  transistor 
ana  valve# 

except  that  a  few  component  values  had  to  be  changed  to  allow  for  the  different  out¬ 
put  voltages  and  that  the  positive  output  terminal  was  earthed  when  an  output  which 
is  negative  relative  to  earth  was  required.  The  circuit  of  Fig.  29  follows  very 
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closely  that  described  by  R. E.  Reynolds  (46)  and  is  very  similar  to  that  of  Fi*  9ft 
except  that  a  transistor  is  used  in  the  amplifier  stage  instead  of  a  vacuum  valCef 
The  transistor  amplifier  can  provide  a  gain  of  several  hundred  and  still  only  requires 
quite  a  small  voltage  across  its  500  K  ohm  load  resistance.  As  a  result  the 
output  voltage  obtainable  from  the  circuit  is  about  35  volts  which  is  the  grid  bias 
required  by  the  series  valve.  The  4.7  ohm  dummy  load  resistance  ensures  that  the 
bv^he^eri 18  T'  7  under  conditions  of  no  load:  the  larger  grid  bias  required 

The  0%  F  7lve  fr  CUt‘of£  would  raise  the  val«e  of  the  minimum  output  voltage. 
The  0.5  uF  condenser  between  base  and  collector  of  the  transistor  prevents  self- 

oscillation.  The  output  impedance  of  the  circuit  is  about  3  ohms  and  the  output 
voltage  varies  by  +0.7%  for  a  +10%  change  of  the  supply  voltage. 


At  the  time  when  the  stabiliser  circuits  shown  in  Fig.  29  were  designed,  it  was 
necessary  to  use  vatuum  valves  in  the  position  of  the  series  valve  because  no  tran¬ 
sistor  capable  of  carrying  more  than  about  10  mA  was  freely  available.  Soon  after¬ 
wards,  however,  the  situation  changed  and  it  became  possible  to  design  series  stabi¬ 
lisers  in  which  the  bulky  series  valves  which  dissipated  a  large  amount  of  heat 
could  be  replaced  by  a  transistor.  It  was  decided  therefore  to  replace  the  existing 
power  pack  for  the  30  volt  supply, .  which  had  to  provide  400  mA,  with  an  all-transi  stor 
stabiliser.  The  circuit  which  was  adopted  was  based  on  designs  given  by  Brown  and 
Stephenson  (2)  and  is  shown  in  Fig.  30.  The  transistors  T:  and  T2  are  connected  in  a 
VO-O-40V 


FsnA 


fig.  30.  Circuit  diagram  of  all-transistor  series  stabiliser. 

long-tail  pair  circuit  in  which  the  reference  voltage  derived  from  the  Zener  diodes 
Z1  and  Z2  are  compared  with  the  output  voltage.  The  amplified  voltage  difference 
controls  the  series  transistor  T4,.  The  output  of  this  stabiliser  varies  by  i&%  for  a 
mains  voltage  change  of  +10%,  the  output  impedance  is  0.7  ohms  and  the  ripple  on  full 
load  is  2  m  volts  peak. 


A  photograph  of  the  recogniser  can  be  seen  in  Fig.  31.  The  rack  which  carries 
the  filters  is  on  the  left,  the  multipliers  and  the  maximum  detector  are  on  the  middle 
rack.  The  potentiometer  matrix  of  the  digram  frequency  memory  is  at  the  top  of  the 
right  hand  rack;  the  phoneme  memory  and  the  typewriter  memory  are  at  the  bottom  of 
this  rack.  The  tape  recorder  which  provides  the  speech  input  and  the  typewriter  with 
the  solenoids  can  be  seen  on  the  table  in  front. 
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CHAPTER  V 


THE  SPEECH  MATERIAL  USED  FOR  TESTING  THE  RECOGNISER 

Having  assembled  and  tested  the  circuitry  that  has  just  been  described,  suitsble 
speech  material  was  needed  to  measure  the  overall  performance  of  the  automatic 
recogniser.  The  speech  material  to  be  used  was,  as  has  already  been  mentioned,  in 
the  form  of  a  list  of  words  which  was  recorded  on  magnetic  tape  for  repeated  use 
with  the  recogniser.  The  words  were  spoken  in  isolation,  with  an  intonation  ap¬ 
propriate  for  a  simple  statement  and  the  speaker  was  asked  to  speak  at  a  constant 
level  ss  judged  subjectively  by  himself.  The  same  word  lists  were  recorded  by 
three  different  speakers,  although  most  of  the  results  quoted  below  were  obtained  by 
using  the  words  spoken  by  one  speaker  only* 


S*veral  different  word  lists  have  been  used  in  the  course  of  the  experiments. 

The  first  to  be  used  consisted  of  139  words;  the  words  contained  only  the  phonemes 
in  the  repertory  of  the  machine,  that  is,  /t  k  s  /  m  n  1  i :  a;  u:  a:/, 
although  in  the  case  of  the  vowels  the  short  /i/  and  /a/  phonemes  were  also  allowed 
and  were  put  into  the  same  class  as  the  long  /i : /  and  /a:/  respectively.  In  this 
way,  for  example,  the  final  vowel  in  the  word  mercy  was  considered  to  be  in  the  same 
phonemic  class  ss  the  vowel  in  teak  and  again  both  vowels  in  the  word  murmur  were 
considered  to  be  the  same.  The  word  list,  both  spelt  and  in  the  appropriate  phonetic 
transcription  is  given  in  Table  2,  and  will  be  referred  to  as  list  1.  The  139  words 


Table  2 

List  of  139  words  used  for  testing  the  automatic  recogniser 


List  1 


t  art 

t  a :  t 

team 

t  i :  m 

tomb 

tu:  m 

terse 

ts:s 

calm 

ka :  m 

coot 

ku:  t 

curt 

ka :  t 

seat 

s  i :  t 

seem 

s  i  :  m 

shark 

/a :  k 

shoot 

fu:  t 

mart 

ma :  t 

meek 

mi :  k 

teat 

t  i  :  t 

teal 

t  i :  1 

tool 

t  u:  1 

term 

t  a :  m 

keel 

ki:  1 

combe 

ku :  m 

curse 

k  a :  s 

seek 

s  i  :  k 

seen 

si  :  n 

sheet 

/  i :  t 

shirt 

/a:  t 

marsh 

ma :  / 

meal 

mi :  1 

teak 

t  i  :  k 

toot 

tu:  t 

Turk 

ta :  k 

cart 

ka :  t 

keen 

ki :  n 

coon 

ku :  n 

curl 

k  a :  1 

cease 

si :  s 

suit 

su :  t 

chic 

/i  :  k 

shirk 

/a:  k 

meet 

mi  :  t 

mean 

mi  :  n 

63 


moot 

mu :  t 

murk 

ms:  k 

le  ak 

lirk 

lean 

li :  n 

loom 

lu:  m 

learn 

la:  n 

soon 

su :  n 

tartan 

ta:  tn 

Tina 

ti : na : 

Khartoum 

ka : tu : m 

car -seat 

ka: si : t 

keener 

ki : na: 

curser 

ka:  sa: 

canoe 

ka: nu : 

sooner 

su: na: 

surcease 

sa: si : s 

salute 

sa:  lu :  t 

shooter 

/ u : ta : 

martyr 

ma : ta : 

Marshall 

ma  :Jl 

meeker 

mi : ka : 

mercer 

ma:  s  a: 

lama 

la : ma: 

leaner 

li  :  na: 

looser 

lu: sa : 

Lulu 

1  u :  1  u  : 

lurker 

la:  ka: 

turtle 

ta :  1 1 

Carson 

ka :  s  n 

colone 1 

ka:  nl 

circ le 

sa :  k  1 

Meik le 

mi :  k  1 

two- seater 

tu: si: ts: 

merciless 

m  a :  s  i  :  1  a :  s 

Number 
Number 
Number 
Number 
Number 
Numbe  r 


Table  2 

(cont .  ) 

moose 

mu:  s 

lark 

la :  k 

lease 

li :  s 

loot 

lu:  t 

lurk 

la:  k 

tarn 

ta :  n 

sheen 

fi :  n 

teeter 

ti:  ta: 

turkey 

ta : k i : 

carter 

ks :  te : 

car-mart 

ka: ma :  t 

cooler 

ku: la: 

curly 

k  a :  1  i : 

seeker 

a  i :  k  a : 

sateen 

a  a :  t  i  :  n 

Sassoon 

sa: su : n 

saloon 

s  a:  lu  :  n 

chi-chi 

/i:/i : 

marke  t 

ma:  ka:  t 

Marner 

ma : na: 

meaner 

mi: na: 

machine 

ma:  f  i  :  n 

litre 

li  :  ta : 

Luton 

lu :  tn 

loosen 

lu :  sn 

lunar 

lu: na: 

lea  rne  r 

1  a :  n  a : 

certain 

sa:  tn 

carnal 

ka  :  nl 

Seaton 

s  i :  tn 

sermon 

s a : m a:  n 

Merton 

m  a:  tn 

Saluki 

sa : lu : ki 

of  words: 
of  sounds: 
of  digrams: 

of  monosyllabic  words: 
of  disyllabic  words: 
of  trisyllabic  words: 


moon 

mu  :n 

leet 

li :  t 

leash 

li:/ 

loose 

lu :  s 

loon 

lu:n 

turn 

ta:  n 

tartar 

ta: ta: 

tee-shirt 

ti:/a:  t 

cartoon 

ka : tu : n 

carton 

ka :  tn 

calmer 

ka  :ma: 

cocoon 

ka:ku:n 

curler 

ka: 1  a: 

sea-mark 

s i  :  ma : k 

Circe 

sa: si : 

s  alaam 

sa: la : m 

chenille 

/a: ni :  1 

shirker 

/a:  k  a: 

Marsham 

ma  :/m 

meter 

mi;ts: 

mercy 

ma:  s  i : 

murmur 

ma:  ma: 

Lima 

1  i :  m  a : 

lucre 

lu:  ka: 

Lucerne 

1  u  :  s  a :  n 

lateen 

la: ti : n 

curtain 

ka:  tn 

castle 

ka :  s  1 

kirtle 

ka:  tl 

seamen 

si:ma:n 

Martian 

ma: /n 

myrtle 

ma:  1 1 

lacuna 

1  a:  ku  :  na 

139 

526 

665 

59 

76 

4 
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•  re  made  up  of  59  monosyllabic  words,  76  disyllabic  words  and  4  trisyllabic  words; 
the  total  number  of  phonemes  in  the  list  is  526  and  there  are  665  digrams,  that  is 
phoneme  transitions,  including  the  transitions  from  inter-word  space  to  initial  phoneme 
and  from  final  phoneme  to  inter-word  space.  The  frequency  of  occurrence  of  the 
phonemes  in  this  list  is  given  in  Table  3  and  the  digram  frequencies,  again  as  found 
in  this  list,  are  given  in  Table  4(a). 


Table  3 


Frequency  of  occurrence  of  phonemes  in  List  1, 
expressed  as  percentages  of  the  total  number  of  phonemes. 


t 

12% 

1 

7% 

k 

10% 

4 

3% 

s 

8% 

i : 

10% 

I 

4% 

a : 

6% 

m 

9% 

u : 

7% 

n 

8% 

9 : 

16% 

Table  4(a) 


Digrau  frequencies  of  phonenes  In  List  1,  expressed  as  a 
percentage  of  tbe  occurrence  of  each  phoneue. 


Second  phoneme 
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Table  4(b) 


Voltage  settings  of  potentiometers  In  store  of  linguistic 
knowledge  to  represent  dlgras  frequencies  In  Table  4(a). 


Se 

t  k  s  /  m 
t 
k 


s 

/  6 


£ 

m 

• 

€) 

c 

ft 

1 

V 

JC 

CL 

Li 

4 

• 

w 

U 

n 

■H 

Us 

a : 

36 

13 

10 

13 

16 

i : 

36 

24 

9 

9 

15 

u: 

26 

7 

20 

13 

d : 

36 

30 

30 

9 

H 

25 

36 

28 

13 

32 

cond  phoneme 


i 

n 

a : 

i  : 

u: 

a  i 

8 

18 

10 

23 

15 

36 

25 

11 

13 

36 

5 

34 

10 

36 

6 

6 

6 

36 

12 

30 

26 

21 

8 

36 

7 

22 

36 

22 

9 

5 

5 

36 

10 

12 

36 

7 

36 

18 

27 

32 

List  1  contains  quite  a  number  of  proper  names  like  Seaton  or  Carson  and  some 
unusual  words  like  lacuna  or  chenille.  This  was  undesirable  in  some  experiments  in 
which  a  human  reader  or  listener  was  asked  to  interpret  the  output  of  the  recogniser 
and  therefore  a  second  list  was  made  up,  to  be  called  List  2,  which  consists  of  a 
selection  of  the  words  from  List  1,  all  proper  names  and  many  of  the  unusual  words  of 
List  1  having  been  omitted..  List  2  consisted  of  75  words  and  these  are  given,  in  the 
randomised  order  in  which  they  were  used  in  some  of  the  experiments  to  be  described 
later,  in  Table  5.  The  words  were  made  up  of  41  monosyllabic  words,  33  disyllabic 


Table  9 


List  of  75  words  used  for  acoustic  and  for  visual 
presentation  of  the  recogniner  output 

List  2 


cart 

se  aman 

tool 

carton 

cur  ly 

moon 

se  rmon 

car-seat 

saloon 

teat 

tomb 

me  ter 

leak 

shirt 

team 

seeker 

cooler 

curta in 

keen 

tartan 

mercy 

seek 

learn 

,  castle 

tart 

shoote  r 

turkey 

meek 

leash 

marsh 

loose 

loot 

lease 

loosen 

term 

cur  t 

seat 

martyr 

colone 1 

seem 

me  a  1 

meeker 

salute 

cartoon 

sheen 

suit 

Turk 

certain 

sooner 

learner 

curse 

teak 

tartar 

calmer 

cease 

market 

mach i ne 

k  e  e  ne  r 

circ le 

mean 

me  aner 

soon 

seen 

carter 

loom 

shark 

sheet 

lean 

turn 

ca  lm 

canoe 

merci le  ss 

lurk 

mee  t 

terse 

of  words: 

75 

No.  of  monosyll 

abic  words: 

of  sounds: 

270 

No.  of  disyllab 

ic  words: 

of  digrams: 

345 

No.  of  trisylla 

bic  words : 
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words  and  one  trisyllabic  word.  The  75  words  contained  altogether  270  phonemes  and 
345  digrams.  List  2  was  never  recorded  separately;  instead,  the  recording  of  List 
1  was  used  as  input  and  the  responses  of  the  recogniser  to  the  words  of  List  2  were 
selected  from  the  output. 

At  a  later  stage  in  the  experiments  it  was  decided  to  increase  both  the  number 
of  phonemes  in  the  repertory  of  the  machine  and  the  number  of  words  in  the  speech 
material  to  be  used  for  testing.  The  phonemes  /z/  and  /f/  were  added  to  the 
categories  that  the  recogniser  could  deal  with  and  a  new,  longer  word  list  was  made 
up.  The  extended  list  of  words  was  obtained  by  selecting  all  those  words  found  in  a 
dictionary  of  about  60,000  common  English  words  (35)  which  contained  no  phonemes  other 
than  the  13  in  the  repertory  of  the  recogniser;  as  before,  the  long  and  short  vowels 
/ i : /  and  /i/,  and  /a:  /  and  /a/  respectively  were  taken  to  be  identical.  This  pro¬ 
duced  a  list  of  just  over  500  words  which,  of  course,  included  many  consonant 
clusters.  The  words  were  recorded  and  applied  to  the  recogniser.  Unfortunately  it 
was  found  that  the  recognition  of  consonants  in  a  cluster  was  very  poor.  This  could 
have  been  remedied  but  only  after  extensive  re-design  of  the  acoustic  recogniser  and 
this  was  not  considered  worthwhile  at  this  stage  of  the  research.  Just  as  an  example 
of  the  kind  of  difficulty  encountered,  the  phoneme  /t/  was  recognised  by  the 
characteristics  of  the  fricative  aspiration  that  follows  the  release  of  the  stop  in 
the  articulation  of  the  phoneme;  when  a  /s/  follows  a  /t/  in  a  cluster,  as  for  in¬ 
stance  in  the  word  shoots,  the  aspiration  merges  with  the  following  fricative  and 
cannot  be  detected.  Another  example  of  the  difficulties  encountered  is  when  two 
plosives  follow  each  other,  as  in  the  word  asked.  Here  the  /k/  is  often  not  ex¬ 
ploded  and  consequently  only  the  aspiration  of  the  following  /t/  can  be  detected  by 
the  recogniser.  In  view  of  these  difficulties  it  was  decided  to  eliminate  from  the 
list  all  words  that  contained  consonant  clusters.  This  resulted  in  list  3  which  is 
given  in  Table  6  in  the  random  order  in  which  it  was  presented  to  subjects  in  some  of 
the  experiments  to  be  described  later.  List  3  contains  200  words,  of  which  124  are 
monosyllabic,  70  disyllabic  and  6  trisyllabic.  The  total  number  of  sounds  is  678 
and  the  total  number  of  digrams  878.  The  frequency  of  occurrence  of  the  phonemes  in 
the  list  is  given  in  Table  7-  The  values  of  digram  frequencies  relevant  to  the  words 
of  List  3  are  not  given  in  this  report  because  in  all  the  experiments  which  used  this 
word  list,  the  linguistic  store  was  adjusted  to  the  so-called  "pure  CVCV"  condition. 
This  means  that  all  CV  and  VC  sequences  were  given  equal  probabilities  whilst  a  CC  or 
YY  sequence  was  made  impossible. 


THE  TESTING  OF  THE  RECOGNISER 

In  the  first  series  of  experiments  the  words  of  List  1  were  applied  to  the 
recogniser  and  the  output  was  recorded  by  typewriter. 

The  effect  of  the  linguistic  information  on  the  recognitions  made  by  the 
machine  was  observed  by  reading  the  complete  word  list  into  the  recogniser  twice.  On 
the  first  reading  the  stored  knowledge  of  digram  frequencies  was  not  used  and  the 
output  was  determined  solely  by  the  acoustic  recognition  circuits;  this  mode  of 
operation  was  available  by  making  the  probabilities  of  occurrence  of  all  phonemes 
permanently  equal  and  will  be  referred  to  in  future  as  the  unbiased  condition.  When 
the  same  word  list  was  read  into  the  machine  for  the  second  time,  the  store  of 
linguistic  knowledge  was  adjusted  to  give  output  voltages  proportional  to  the  digram 
frequencies,  and  the  recogniser  was  said  to  operate  in  the  biased  condition.  The 
potentiometer  sliders  of  the  linguistic  store  were  adjusted  to  correspond  to  the 
digram  frequency  values  given  in  Table  4  (a)  by  making  the  output  corresponding  to 
the  largest  digram  frequency  in  any  horizontal  row  in  the  table  equal  to  36  volts 
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Table  8 


List  of  200 

words 

List 

3 

car 

mousse 

teaser 

litre 

alarm 

tars 

meat 

cur 

marquees 

tartars 

looter 

lucerne 

curs 

seek 

catarrh 

meek 

loose 

tartan 

t  e  ase  r s 

afoot 

are 

aloof 

murmurs 

neat 

earl 

fees 

farm 

cars 

turf 

myrrh 

mars 

two 

Erse 

lose 

almone  r 

Caesar 

coot 

carcass 

looser 

meters 

lunar 

loom 

shirkers 

ark 

feet 

tar 

cool 

mazurka 

cocoon 

fern 

feeler 

salute 

shoot 

loser 

oc  curs 

anne  a  1 

machine 

leak 

loofah 

tomb 

zeal 

mart 

calf 

firmer 

litres 

Turk 

leas 

mar 

tar lat  an 

moot 

sirs 

shoes 

soone  r 

earn 

lark 

fee lers 

leash 

f  arme  r  s 

murmur 

neater 

carter 

shirke  r 

tool 

saloon 

tart 

cart 

soon 

niece 

art 

mark 

loot 

knee 

farmer 

shark 

mercer 

combe 

learners 

accoutre 

shoe 

knees 

fee 

calm 

meter 

tarsus 

seen 

marsh 

noon 

learn 

-70 


Table  •  (cont .  ) 


assert 

losers 

me  an 

sees 

eke 

leaf 

furs 

lama 

loon 

lucre 

psalm 

surf 

lamas 

me 

marquee 

occur 

turn 

affirm 

laugh 

canoe 

eel 

Zulu 

curt 

teal 

looters 

furl 

shirt 

ooze 

almoners 

mercers 

c ircus 

te  am 

afar 

le  an 

zoo 

ease 

cartoon 

cease 

keys 

sheet 

shirk 

meeker 

term 

sir 

Zulus 

fur 

canoes 

le  a 

lemurs 

curse 

sheen 

martyr 

see 

teak 

fee  1 

eat 

se  at 

lemur 

tartar 

farce 

cur  1 

alert 

le  arner 

me  al 

tarn 

lurk 

nurse 

Shah 

knee  1 

sheaf 

martyrs 

far 

coo ler 

tease 

arm 

key 

facet  ious 

zoom 

coon 

moon 

khan 

lease 

tea 

salaam 

firm 

terse 

a  rmour 

seem 

noose 

fool 

kee  1 

seal 
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Table  7 


Frequency  of 

occurrence  of  phonemes  In  word 

list 

Z  and  word  list 

3. 

Phoneme 

■ord  List  2 

Frequency  of  occurrence 

mu  ,  Percentage 

Number  of  »  .  .  , 

of  total  no. 
occurrences  .  „u 

of  phonemes 

■ord  List  3 

Frequency  of  occurrence 

mu  t  Percentage 

Number  of  ,  ,  * 

of  total  no. 
occurrences  ,  , 

of  phonemes 

t 

36 

13 

61 

9 

k 

29 

11 

51 

8 

s 

26 

10 

40 

6 

/ 

8 

3 

16 

2 

m 

23 

8 

54 

8 

n 

25 

9 

39 

6 

1 

21 

8 

60 

9 

f 

- 

- 

30 

4 

z 

- 

- 

47 

7 

i : 

31 

11 

63 

9 

a : 

15 

6 

51 

8 

u : 

16 

6 

49 

7 

9 : 

40 

15 

117 

17 

Tot  al 

270 

100 

678 

100 

and  all  other  outputs  in  the  same  row  were  made  smaller  proportionally  to  the  ap¬ 
propriate  digram  frequency.  For  instance  the  phoneme  /u:/  is  followed  most  often, 

30%  of  the  time,  by  the  phoneme  /n/  and  11%  of  the  time  by  /m/;  therefore  the 
potentiometer  supplying  information  about  the  /u:/  to  /n/  digram  frequency  is  set  to 
give  an  output  of  36  volts  and  the  one  for  the  /u:/  to  /m/  digram  frequency  an  output 
ot  13  volts.  On  the  other  hand,  /i : /  is  followed  most  frequently,  23%  of  the  time, 
by  /t/  and  15%  of  the  time  by  /k/;  therefore  the  voltage  indicating  the  /i : /  to  /t / 
digram  frequency  is  made  36  volts  and  that  for  /i :  /  to  /k/  is  made  24  volts,  and  so 
on.  Any  digram  frequency  that  would  have  had  to  be  represented  by  an  output  of  less 
than  5  volts  -  about  4  to  5%  in  most  cases  -  was  made  to  equal  zero  on  the  potentio¬ 
meter  matrix.  Tbe  complete  set  of  voltages  from  the  linguistic  store  is  given  in 
Table  4(b). 
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Typical  recognitions,  as  typed  by  the  recogniser,  are  shown  in  Fig.  32. 


seek 

shooter 

sheet 

sik 

5ute 

5it 

sik 

5uta 

5uns 

sik 

5uta 

5it 

% 

meter 

meeker 

lease 

mite 

mike 

lis 

mitea 

mtkea 

is 

mite 

mike 

1 

Fig.  32.  Typical  recogniser  outputs. 

The  first  line  gives  the  spelling  of  the  word  and  the  second  line  the 
phoneme  transcnption  using  the  arbitrary  symbols  of  the  typewriter. 
The  third  and  fourth  lines  show  typical  recogniser  outputs  for  the  un¬ 
biased  and  biased  modes  of  operation  respectively. 


The  actual  characters  printed  by  the  typewriter  could  be  chosen  arbitrarily  ac- 

s^8  ^  deSlrr\Code-  The  International  Phonetic  Association  (I.P.A.) 

inHuH*  WCre  T  rd  becaU®C  the  typewriter  had  a  standard  keyboard  which  did  not 
include  many  of  the  accepted  phonetic  symbols.  The  phonemes  /t,  k,  s,  m,  n  1/  were 

^t!dTe?H  V  C°nventlonal  sy»bols.  the  vowels  /a:,  i:,  u:  and  ar/were  repre- 

In  Fie  32  the  U\and  •  and  thC  phoneme  ///  was  typed  the  figure  5. 

scri«io„  u  in  T  "l  “  shoW"  T  n°rmal  sPellin*  in  the  top  line  and  the  tran- 
scription  using  the  arbitrary  symbols  is  shown  in  the  second  line;  the  second  line 

output  ofth^r  tecognis er  should  type  if  it  were  working  correctly.  The  actual 

Riven  in  the  Jlrd8'!1861'  l  ^  °f  the  lineuistic  information  is 

about  dieram  f  8  ‘f*  °UtPUt  tyPed  When  ^  is  usin*  the  stored  information 

exa^le  fJTh  en\leS  “  the  f°Urth  “nd  laSt  line*  first  word  i*  an 

example  of  the  case  when  the  recogniser  produces  the  desired  output  whether  it  uses 

the8moH  °rmatl°n  7  r*  Th*  SeC°nd  W°rd  “  shooter  ~  i«  typed  wrongly  whatever 

he  mode  of  operation  of  the  recogniser:  the  linguistic  information  is  apparently 

not  strong  enough  to  correct  the  final  vowel  wrongly  recognised  as  /a:/  to  an  /»:  / 

the  faCt  tbat  to  /••’/  digram  frequency  is  more  than  three  times  as  great 

information  r°m  7  {  ^  7  '  “  clear  examPle  of  h°*  the  linguistic 

information  can  help  in  improving  the  performance  of  the  recogniser:  the  /(/  to 

lgram<.1®  about  three  times  more  probable  than  the  ///  to  /u :/  digram.  The 

l  ^and  flf^h  °re  “ddttional  examples  of  the  same  effect;  they  also  show 

rlTc,7777\  .  'f*  ling/1Stic  information  in  many  cases  not  only  produces  a  correct 
cognition  but  also  reduces  the  frequency  with  which  additional  symbols  are  typed, 
ym  o  s  that  apparently  have  no  counterpart  in  the  speech  input.  The  last  example, 
the  word  lease,  shows  that  the  use  of  linguistic  information  can  also  have  a  de- 
nmental  effect:  once  the  wrong  recognition  is  made  the  digram  frequencies  pro¬ 
duced  for  subsequent  recognitions  from  the  linguistic  memory  will  also  be  wrong  and 

the  t0  *  c™ulative-  In  the  case  of  the  example,  the  acoustic  input  for 

lnitiai  consonant  /!/  was  weak  and  similar  in  characteristics  to  the  following 
vowel  /i:/j  when  the  recogniser  was  operating  without  linguistic  information  it 
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ignored  the  initial  ambiguous  segment  and  then  typed  the  vowel  and  final  consonant 
correctly.  When  the  linguistic  information  was  used  it  again  missed  the  initial 
segment  but  recognised  the  vowel  as  an  /l/  because  initial  vowels  do  not  exist  in  its 
vocabulary.  Once  it  had  recognised  the  consonant,  the  linguistic  information  pro¬ 
duced  for  the  next  recognition  would  exclude  the  possibility  of  another  consonant; 
on  the  other  hand  the  fricative  characteristics  of  the  next  acoustic  segment  have  no 
resemblance  to  any  of  the  patterns  stored  for  vowels.  None  of  the  final  multiplica¬ 
tion  products  will  therefore  be  above  the  threshold  set  by  the  maximum  detector  and 
the  only  symbol  typed  for  the  whole  word  is  the  single  l 


THE  PERFORMANCE  OF  THE  RECOGNISER:  INPUT/OUTPUT  COMPARISONS 

The  performance  of  the  recogniser  was  assessed  by  comparing  the  symbols  typed 
at  the  output  with  the  phoneme  sequences  of  the  input  words;  the  phoneme  recognitions, 
expressed  as  percentages  of  the  total  occurrence  of  each  phoneme  in  the  input,  are 
shown  in  confusion  matrices.  Table  8  gives  two  sets  of  results,  one  obtained  with 
the  recogniser"operating  without  the  use  of  linguistic  information  and  the  other  with 
the  use  of  linguistic  information.  The  column  headed  On.  gives  the  proportion  of 
cases  in  which  a  particular  phoneme  at  the  input  was  omitted  altogether  from  the 
output.  Not  shown  in  these  matrices  and  not  considered  in  the  calculation  of  the 
percentages  given  in  the  table  is  the  number  of  additional,  unwanted  symbols  that 
were  typed;  the  significance  of  these  will  be  discussed  later. 

The  overall  score  for  assessing  the  performance  of  the  recogniser  was  computed 
by  considering  three  different  kinds  of  error:  incorrect  recognitions  (mistakes), 
omissions  and  extra  symbols  typed.  The  error  rate  was  found  to  be  40%  (or  60% 
correct  recognitions)  for  the  unbiased  state  and  28%  (or  72%  correct)  for  the 
biased  state.  The  inclusion  of  the  extra  symbols  typed  in  the  computation  of  the 
error  rates  means  that  it  is  not  really  justifiable  to  deduce  the  score  for  correct 
recognitions  from  the  error  rates  and  they  have  therefore  been  given  in  brackets; 
when  the  extra  symbols  typed  are  not  considered  the  score  for  correct  recognitions 
becomes  72%  and  75%  for  the  unbiased  and  biased  conditions  respectively.  Further 
information  about  the  errors  made  by  the  recogniser  is  given  in  Table  9.  The  table 
gives  separate  figures  for  the  total  number  of  mistakes,  omissions  and  extra  symbols 
typed  in  both  the  unbiased  and  biased  conditions  and  it  also  shows  how  these  errors 
were  distributed  among  the  11  phonemes  in  the  recogniser’s  vocabulary.  The  number  of 
mistakes,  omissions  and  extra  sounds  typed  are  given  as  percentages  of  the  total 
number  of  mistakes,  etc.  respectively;  separate  figures  are  given  again  for  biased 
and  unbiased  operation.  For  example,  in  the  unbiased  state  10%  of  the  total  number 
of  omissions  (=21)  were  omissions  of  the  phoneme  /t/.  Again,  in  the  biased  state 
23%  of  the  wrong  recognitions  (=43)  were  recognised  as  /m/,  etc. 

The  results  show  that  on  the  acoustic  level  the  major  difficulties  were 
associated  with  the  plosives  and  with  the  nasals  and  laterals.  The  plosives  as  a 
group  could  be  identified  quite  readily,  giving  a  score  of  about  85%,  but  many  of 
the  /k/  inputs  were  recognised  as  /t/,  reducing  the  score  for  /k/  alone  to  31%. 

Some  experimental  results  obtained  by  the  Haskins  Laboratories  in  listening  tests 
using  synthetic  speech  may  perhaps  give  some  explanation  of  these  difficulties.  The 
automatic  recogniser  uses  the  spectrum  of  the  “plosive  burst"  to  distinguish  between 
/ 1/  and  /k/;  the  Haskins  work  has  shown  (39)  that  the  spectrum  of  this  burst  for  a 
particular  plosive  consonant  varies  as  a  function  of  the  adjoining  vowel;  a  high 
frequency  “burst*  clearly  indicates  a  /t/  only  when  this  burst  is  associated  with  an 
/i:/  vowel  and  it  may  be  recognised  more  and  more  as  a  /k/  or  /p/  as  the  vowel  quality 
changes  along  the  circumference  of  the  vowel  diagram  through  /a:/  to  /u:  /.  Further,  a 
medium  frequency  (around  1.500  c.p.s.)  burst  characterises  a  /k/  much  more  clearly  when 


Input  Input 
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Table  8 


Confusion  matrices  for  input/output  comparisons,  using  word  list  1. 
Results  are  expressed  as  percentages  of  the  total  occurrences  of  each  phoneme. 


(a)  Unbiased  operation 


Output 
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Table  9 


Analysis  of  mistakes,  omissions  and  extra  symbols  typed  by  the  machine  in 
the  biased  (B)  and  in  the  unbiased  (UB)  mode  of  operation,  uaing  word  list  2. 


Mistakes  Omissions  Extras 


Total  number 

UB 

54 

B 

43 

UB 

21 

B 

26 

UB 

35 

B 

5 

% 

% 

% 

% 

% 

% 

t 

33 

33 

10 

4 

15 

k 

14 

12 

9 

20 

s 

6 

7 

5 

8 

• 

20 

/ 

6 

6 

m 

11 

23 

5 

4 

3 

n 

7 

5 

14 

12 

24 

20 

1 

2 

16 

19 

4 

3 

40 

i  : 

17 

2 

14 

15 

9 

a  : 

11 

9 

5 

15 

u : 

6 

8 

9 

9 : 

2 

5 

14 

34 

6 

associated  with  the  back  vowels  like  /a:/  and  /u:/  than  with  a  front  vowel  /i:/.  The 
results  obtained  for  /t/  and  /k/  in  a  later  series  of  experiments  in  which  the  words 
of  Liat  3  were  applied  to  the  automatic  recogniaer  were  specially  analyaed  to  aee  how 
the  correct  recognition  of  these  plosives  was  affected  by  the  adjoining  vowels.  It 
was  found  that  /t/  .was  always  recognised  correctly  when  adjoining  the  vowel  /i:/,  but 
only  70%,  50%  and  25%  of  the  time  correctly  when  pronounced  with  the  vowels  /a:/, 

/&:/  and  /u:/  respectively;  similarly  the  A/  recognitions  were  correct  70%,  60%, 

50%  and  20%  of  the  time  when  the  plosive  was  pronounced  with  /u:/,  /a:/  /»:/  and  /i:/ 
respectively.  Thia  suggests  therefore  that  one  way  of  getting  better  recognition  of 
these  plosives  would  have  been  to  consider  the  nature  of  the  adjoining  vowel  when 
assessing  the  spectrum  of  the  plosive  burst. 
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As  fer  as  the  nasals  and  laterals  are  concerned,  even  human  listeners  do  not 
always  find  it  easy  to  discriminate  between  /m/  and  /n/  generally  and  between  /m/, 

/n/  and  /l/  when  the  discrimination  has  to  be  based  on  two  formants  only,  as  in  the 
recogniser,  instead  of  on  three. 

A  comparison  of  the  results  obtained  when  the  recogniser  was  operating  in  the 
biased  and  the  unbiased  condition  indicates  that  although  the  use  of  linguistic  in¬ 
formation  did  not  improve  the  score  for  correct  phoneme  recognitions  to  any  great 
extent  several  significant  differences  can  be  observed  between  the  two  sets  of  scores. 
For  instance  the  consonant  /l/  was  recognised  correctly  more  than  twice  as  often  when 
the  recogniser  was  working  in  the  biased  state  than  in  the  unbiased  condition. 

Another  difference  between  the  two  sets  of  results  is  the  considerably  smaller  number 
of  extra  symbols  typed  in  the  biased  condition,  only  5  compared  with  35  in  the  un¬ 
biased  condition.  Yet  another  difference  is  that  the  number  of  omissions  is  greater 
in  the  biased  condition.  This  is  largely  due  to  cumulative  errors  occasionally  pro¬ 
duced  by  the  use  of  linguistic  information.  A  typical  way  in  which  this  comes  about, 
as  has  already  been  mentioned  earlier,  is  that  the  machine  fails  to  recognise  the  ini¬ 
tial  consonant  of  a  word:  when  the  acoustic  recogniser  afterwards,  quite  correctly, 
detects  the  following  vowel  the  influence  of  the  stored  linguistic  knowledge  does 
not  allow  the  vowel  symbol  to  be  typed  and  produces  a  consonant  instead.  Consequently 
during  the  next  recognition  the  wrong  set  of  digram  frequencies  will  be  utilised  and 
either  the  wrong  recognition  is  made  again  or  none  at  all. 

Despite  the  relatively  small  improvement  in  phoneme  score  and  the  increase  in 
the  number  of  omissions,  the  beneficial  effect  of  using  linguistic  information  is 
very  noticeable.  It  is  evident,  even  on  first  inspection,  that  the  phoneme 
sequences  typed  when  linguistic  information  is  used  are  more  like  that  of  English  and 
as  a  result,  the  words  typed  give  the  impression  of  being  possible  English  words  even 
if  they  do  not  make  sense,  whilst  when  no  linguistic  information  is  used  many  of  the 
words  typed  have  a  distinctly  non-English  appearance  like  for  instance  the  last  but 
one  example  given  in  Fig.  32. 

The  word  score,  obtained  by  computing  the  proportion  of  complete  words  recognised 
without  mistake,  omission  or  extra  phoneme,  was  then  calculated  for  the  results  ob¬ 
tained  in  the  biased  and  unbiased  state  of  the  recogniser.  The  results  show  that  this 
word  score  has  indeed  increased  considerably,  from  24%  in  the  unbiased  state  to  43% 
when  linguistic  information  was  used. 

Although  the  comparison  of  input  and  output,  in  the  way  just  described  is  useful 
for  assessing  the  performance  of  the  recogniser,  it  is  by  no  means  the  only,  and 
probably  not  even  the  most  relevant  way  of  deciding  how  far  it  is  worthwhile  using 
linguistic  information  in  the  automatic  recognition  process. 

Before  discussing  the  rationale  of  such  other  methods  and  the  assessment  of  the 
recogniser ’s  performance  obtained  by  applying  them,  input/output  comparisons  for  the 
results  of  some  further  experiments  with  the  automatic  recogniser  will  be  described 
first.  It  will  be  remembered  that  a  third  word  list,  consisting  of  200  words,  was 
also  prepared  and  that  this  list  included  two  further  phonemes,  /f/  and  /z/,  in  its 
vocabulary.  An  important  reason  for  increasing  the  phoneme  repertory  was  to  extend 
the  range  of  words  that  the  recogniser  could  tackle:  this  was  needed  for  making 
other  experiments,  to  be  described  later,  possible.  The  additional  electronic 
circuits  were  put  into  use  and  the  recorded  word  list  was  applied  to  the  recogniser. 
Again,  the  recogniser  was  tested  in  the  biased  and  unbiased  condition.  As  with 
List  2,  the  difference  between  the  overall  phoneme  scores  obtained  in  the  two  modes 
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of  operation  is  not  very  great:  62%  in  the  unbiased  condition  and  68%  when 
linguistic  information  is  used.  The  number  of  extra  characters  typed,  which  were 
not  included  in  the  scores  just  quoted,  was  however  considerably  greater  for  both 
biased  and  unbiased  operation.  In  the  biased  state  the  number  of  extra  characters 
typed  was  52,  instead  of  5  for  List  2,  and  in  the  unbiased  state  176.  instead  of  33 
previously.  This  increase  is  partly  due  to  the  different,  less  careful,  articulation 
of  the  speaker  and  partly  due  to  the  shortening  of  all  time  constants  of  the  recogniser 
circuitry  by  about  20  to  25%  which  may  well  mean  that  some  of  the  formant  transitions 
are  recognised  as  separate  phonemes.  As  a  result  of  the  increased  number  of  extra 
symbols  typed  the  proportion  of  correctly  recognised  words  has  decreased  to  35%  in  the 
biased  state  of  the  recogniser,  as  compared  to  43%  for  List  2.  A  complete  analysis  of 
the  output  of  the  recogniser  in  its  biased  mode  of  operation  is  shown  in  Table  10. 


As  expected,  the  scores  have  nut  changed  greatly  as  congjared  with  List  2,  except  for 
the  phonemes  /t/,  /f/,  /s/'  and  /z/  which  all  use  similar  spectral  cues.  The  additional 
errors  in  the  recognition  of  these  phonemes  are  largely  within  this  group.  A  further 
change,  as  compared  with  List  2  is  that  all  initial  /m/,  /n/  and  /I /  sounds  are  grouped 

together  cruder  the  lal  el  a,  all  final  /jTr/  and  h-1  iround*  ate  Label 4ad  n  end  the  final 
/I /  remains  as  l. 

THE  EFFECT  OF  USING  MORE  THAN  ONE  SPEAKER 

It  was  obviously  of  interest  to  know  how  far  the  results  achieved  with  one 
speaker’s  voice  are  maintained  when  the  same  words  are  spoken  by  a  different  speaker. 
The  words  of  List  3  were  spoken  by  two  additional  male  speakers  and  the  recordings  used 
to  test  the  recogniser.  The  voice  of  one  speaker  (F)  was  used  in  all  experiments 
described  so  far  and  the  circuitry  was  adjusted  to  perform  best  with  his  voice.  The 
recordings  made  by  the  second  and  third  speakers  (T  and  G)  were  then  applied  one  after 
the  other  to  the  recogniser  operating  in  the  biased  mode.  The  results,  shown  in  Table 
11.  indicate  that  the  overall  phoneme  score  has  decreased  to  about  50%  to  55%  from  the 
value  of  about  70%  achieved  with  the  voice  of  the  first  speaker  and  that  the  number 
of  extra  symbols  typed  has  remained  substantially  unchanged.  As  a  further  test  the 
circuitry  of  the  recogniser  was  re-arranged  to  give  the  best  possible  results  with 
the  voice  of  the  second  speaker  (T)  rather  than  that  of  the  first  speaker  (F).  The 
re-arrangement  consisted  of  connecting  some  of  the  multiplier  inputs  to  different 
filters  and  of  changes  in  the  extent  to  which  the  filter  output  volt&ges  were  divided 
down  before  being  applied  to  the  multipliers.  The  scores  obtained  when  the  words 
spoken  by  (T)  were  now  applied  to  the  recogniser,  both  for  the  biased  and  unbiased 
modes  of  operation,  are  also  shown  in  Table  11.  It  will  be  seen  that  the  results 
obtained  with  this  second  voice  are  now  very  similar,  both  in  terms  of  correctly 
recognised  phonemes  and  of  extra  symbols  typed,  to  those  obtained  with  the  first 
speaker’s  voice  in  the  previous  adjustment  of  the  recogniser;  the  similarity  is 
equally  marked  when  the  results  in  the  biased  mode  are  compared  witn  each  other  and 
those  in  the  unbiased  mode.  It  seems  then  that  the  recogniser  performs  best  when  it 
is  adjusted  to  the  voice  of  one  speaker  and  that  the  score  drops  markedly  when  another 
speaker  is  used;  as  far  as  one  can  tell  from  using  only  three  different  voices,  this 
drop  in  score  does  not  vary  greatly  from  speaker  to  speaker.  The  fact  that  on  using 
a  different  voice  the  recogniser’s  performance  could  be  restored  by  relatively  minor 
adjustments  suggests  that  it  would  not  be  difficult  to  "teach”  the  recogniser  to 
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Table  10 


Confusion  matrix  for  input/output  comparison  of  results  obtained  in  biased 
mode  of  recogniaer,  using  word  list  3.  Results  are  shown  as  percentages  of  the 
total  occurrence  of  each  phoneme. 


Total  phoneme  score:  68 
Vowel  score:  77 


Consonant  score:  62 
Word  score:  35 
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Table  11 


Comparison  of  results 
peakers,  F,  T  and  G. 

obtained  when 

using  the 

voices 

of  three 

different 

Speaker 

Recogniser 
adjusted  to 
deal  optim¬ 
ally  with 
voice  shown 
below 

Mode  of 
operation 
of 

recogniser 

B/UB 

Overall 

phoneme 

score 

% 

Vowel 

score 

% 

Conso¬ 

nant 

score 

% 

No.  of 
extra 
symbols 
typed 

F 

F 

UB 

62 

54 

68 

176 

F 

F 

B 

68 

77 

62 

52 

T 

F 

B 

51 

66 

41 

51 

G 

F 

B 

56 

61 

52 

62 

T 

T 

UB 

64 

61 

66 

169 

T 

T 

B 

65 

75 

58 

60 

Just  Itself  to  the  voice  of  different  speakers.  One  could  arrange  for  example,  that 
fj-esh  speaker  would  first  have  to  say  a  test  sentence;  the  recogniser,  on  being 
told  that  it  is  now  dealing  with  a  new  voice  and  that  the  known  test  sentence  is  being 
spoken,  would  go  through  a  pre-arranged  routine  of  multiplier  input  changes,  each  time 
checking  the  degree  of  success.  It  could  then  choose  that  setting  which  gives  the  best 
pertormance  in  recognising  the  test  sentence  spoken  with  a  fresh  voice. 


THE  PERFORMANCE  OF  THE  RECOGNISER:  VISUAL  AND  ACOUSTIC  TESTS 

The  comparison  of  input  and  output  and  the  compiling  of  the  confusion  matrices 
based  on  these  comparisons  has  proved  a  useful  way  of  evaluating  the  recogniser,  of 
finding  the  causes  of  errors  and  remedies  for  these.  As  has  already  been  stated  how¬ 
ever,  this  may  not  be  the  only  or  necessarily  the  most  relevant  way  of  assessing  the 
performance  of  a  recogniser.  Whatever  the  use  to  which  the  output  of  the  recogniser 
is  put,  it  will  probably  be  presented  in  one  form  or  another  to  a  human 

reader  .  This  reader  has  to  interpret,  that  is  understand,  the  output  and  his  own 
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knowledge  of  the  language  is  available  to  correct  some  of  the  errors  made  by  the 
automatic  recogniser.  the  extent  to  which  the  reader  can  use  his  own  linguistic 
knowledge  will  depend  on  the  kind  of  mistakes  made  by  the  recogniser  and  also  on  the 
form  in  which  the  output  is  presented  to  him.  The  more  familiar  the  form  of  pre¬ 
sentation  the  easier  the  reader  will  find  it  to  use  his  linguistic  knowledge  for  this 
purpose  and  if  the  presentation  is  in  an  unfamiliar  form  then  learning  can  make  as¬ 
similation  easier.  Two  new  ways  of  assessing  the  performance  of  a  recogniser  and  of 
the  difference,  if  any,  made  by  the  use  of  linguistic  information  in  the  automatic 
recognition  process  then  suggest  themselves:  one  is  to  compare  the  reader’s  response 
to  the  output  with  the  words  applied  to  the  input  and  the  other  is  to  find  the  amount 
of  learning  required  by  the  reader  in  order  to  reach  a  given  performance  in  under¬ 
standing  the  recogniser’ s  output. 


As  has  just  been  mentioned,  the  way  in  which  the  reader  can  deal  with  the  output 
of  the  recogniser  depends  on  the  form  in  which  this  output  is  presented  to  him  and  it 
seems  worthwhile  therefore  to  consider  what  are  the  most  likely  ways  in  which  the  out¬ 
put  of  the  automatic  recogniser  will  be  put  to  practical  use.  Apart  from  its  possible 
use  for  the  voice  control  of  machinery  or  of  processes  -  an  application  where  the 
question  of  a  human  reader  does  not  arise  anyway  -  the  most  likely  applications  are  in 
analysis-synthesis  telephony  and  as  a  speech  typewriter.  In  the  first  of  these  ap¬ 
plications,  the  phoneme  sequence  detected  by  the  recogniser  is  transmitted  and  used  to 
control  some  sort  of  speech  synthesiser:  the  “reader"  in  this  case  will  have  to  in¬ 
terpret  an  acoustic  transform  of  a  phoneme  sequence  or  in  other  words  he  will  deal 
with  audible  speech.  In  the  second  one  of  the  above  applications  the  output  is  pre¬ 
sented  to  him  in  some  form  of  writing  which  lie  has  to  read.  The  output  of  the 
recogniser  was  therefore  presented  in  visual  and  acoustic  form  to  separate  groups  of 
subjects  to  see  how  these  forms  of  presentation  compare  and  how  far  the  reader  can 
correct  mistakes  made  by  the  recogniser.  The  words  of  List  2  were  used  and  altogether 
4  experiments  were  carried  out:  the  output  of  the  recogniser  operating  in  the  biased 
and  in  the  unbiased  mode  was  presented  acoustically  and  visually  to  separate  groups  of 
subjects. 


For  acoustic  presentation  the  phoneme  sequence  typed  by  the  recogniser  was  pron¬ 
ounced  by  a  speaker  who  was  used  to  reproducing  phonetic  transcription.  The  reader 
produced  the  words  on  monotone  and  in  the  case  of  polysyllabic  words  equal  stress  was 
used  for  each  syllable.  The  subjects  were  asked  to  write  down,  in  normal  spelling, 
whatever  word  they  recognised. 


In  the  case  of  visual  presentation  the  output  typed  by  the  recogniser  could  have 
been  used  directly.  The  symbols  typed  for  the  different  phonemes  were,  however, 
chosen  at  the  time  when  the  recogniser  was  constructed  and  with  certain  practical 
considerations  in  mind  rather  than  from  the  point  of  view  of  what  would  be  easy  to 
read.  The  output  was  re-typed  therefore,  using  a  different  set  of  symbols  for  the 
various  phonemes,  symbols  which  were  as  near  as  possible  to  normal  English  spelling 
and  therefore  it  was  thought  could  be  read  without  difficulty  by  the  average  English 
speaking  subject.  The  symbols  used,  together  with  key  words,  are  shown  in  Table  12 
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Table  12 

Key  to  the  tranaliteration  need  in  the  visual 
preseatatloa  of  the  recogalaer  output. 


The  eymbole  used  for  the  11  phonemes  ere  given  side  by  side  with  key  words 
to  indicate  their  value. 


t 

t 

in 

tool  or  k  in  cool 

k 

k 

in 

cool 

m 

m 

in 

aothar  or  n  in  nothing  or  1  in  lesnon 

n 

n 

in 

nothing 

1 

1 

in 

leaaon 

8 

s 

in 

soak 

sh 

sh 

in 

shake  or  in  engar 

ee 

ee 

in 

fleet  or  in  bean  (etc.) 

oo 

oo 

in 

boot 

er 

er 

in 

burn  or  in  after 

ah 

ah 

in 

barn  or  in  2nd  syllable  of  shorter 

and  were  given  in  this  form  to  all  subjects  in  the  visual  experiments  prior  to  the 
actual  test.  On  inspecting  this  key  it  will  be  seen  that  some  of  the  common  mistakes 
made  by  the  recogniser  were  also  pointed  out;  for  instance  the  recogniser  frequently 
printed  a  t  for  a  /k/  phoneme  and  therefore  tool  as  well  as  cool  are  given  as  key 
words  for  t.  The  key  gives  alternative  spellings  for  individual  phonemes.  For  in¬ 
stance  the  key  words  given  for  /i:/  are  fleet  and  bean;  it  was  hoped  to  explain  in 
this  way  that  the  symbols  typed,  ee  in  this  case,  represented  phonemes  rather  than 
spelling  forms.  The  subjects  were  given  a  sheet  on  which  the  output  of  the  recogniser 
was  printed  in  the  transliteration  of  Table  12  and  they  were  asked  to  write  the  words 
they  recognised,  in  normal  spelling  alongside  the  appropriate  printed  transcription. 
Separate  sheets  were  prepared  for  the  outputs  obtained  when  the  recogniser  was  op¬ 
erating  in  the  biased  and  unbiased  mode  and  they  were  presented  to  different  groups 
of  subjects.  As  the  subjects  were  asked  to  write  down  words,  using  normal  spelling, 
a  certain  amount  of  care  had  to  be  exercised  when  marking  the  results  so  as  to  allow 
for  the  vagaries  of  spelling.  For  instance  both  colonel  and  kernel  were  taken  as 
correct  for  the  phoneme  sequence  /k'a:nl/,  both  loot  and  lute  for  /lu:t/,  etc.  On 
the  other  hand  cease  was  a  correct  response  for  the  phoneme  sequence  /si:s/  but  the 
word  sees  was  an  incorrect  response. 


Table 


Table  14 


Confusion  matrices  obtained  from  the  responses  to  visual  (a)  and  acoustic 
(b)  presentations  of  the  biased  recogniser  output.  Confusions  are  expressed 
as  a  percentage  of  the  total  occurrences  of  each  phoneme;  those  amounting  to 
less  than  1  per  cent  are  disregarded.  The  column  headed  Om.  gives  the  per¬ 
centage  of  cases  in  which  the  phoneme  was  either  omitted  or  replaced  by  some 
phoneme  outside  the  repertory  of  the  machine. 

(a)  Visual  Response 


t  k  s ■  /  m  n  1  i:  a:  u:  »:  Om. 


(b)  Acoustic  Response 
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The  results  of  the  visual  and  of  the  acoustic  tests  .are  summarised  in  Table  13 
and  confusion  matrices  for  the  results  obtsined  from  the  visual  and  acoustic  pre¬ 
sentation  of  the  biased  output  of  the  recogniser  are  shown  in  Table  14.  The  data 
show  that  the  results  for  presenting  the  output  of  the  recogniser  to  human  interpreters 
are  similar  to  those  for  input/output  comparison;  also  the  overall  scores  from  the 

responses  to  acoustic  presentation  are  not  very  different  from  those  from  the  visual _ . 

presentation.  When  subjects’  responses  are  examined  in  more  detail,  however,  a  number 
significant  differences  between  responses  to  the  two  modes  of  presentation  can  be 
observed,  showing  that  in  effect  the  subjects  go  through  a  somewhat  different  re¬ 
cognition  process  in  the  two  cases.  In  the  normal  process  of  speech  recognition  the 
subject  is  used  to  interpreting  sound  patterns  with  reference  to  his  linguistic  memory 
and  therefore  when  the  recogniser’s  output  is  presented  to  him  acoustically  he  can 
use  this  memory  directly.  On  the  other  hand  when  the  recogniser’s  output  is  given  to 
him  visually  he  first  has  to  go  through  a  process  of  thinking  of  the  acoustic  form  of 
the  printed  symbols  before  he  can  use  his  linguistic  memory.  It  seems  that  the  sub¬ 
jects  did  not  find  this  an  easy  process,  although  they  all  had  previous  experience  of 
reading  phonetic  transcription.  For  example  the  words  colonel,  sooner  and  circle, 
because  of  mistakes  made  by  the  recogniser,  appeared  in  the  visual  presentation  as 
lernl,  sooner!  and  serker.  None  of  the  subjects  in  the  visual  tests  recognised  theae 
words  correctly  but  about  25%  of  those  in  the  acoustic  tests  did  so. 


It  seems  also  that  subjects  doing  the  acoustic  tests  were  much  more  likely  to 
make  phonemic  substitutions  whilst  those  presented  with  the  visual  form  of  the  output 
tended  to  operate  with  complete  words.  One  indication  for  this  is  that,  when  in 
doubt,  the  subjects  in  the  acoustic  tests  experimented  freely  with  phonemic  sub¬ 
stitutions  in  order  to  produce  a  word  that  they  thought  might  be  the  right  answer 
whilst  those  doing  the  visual  tests  often  did  not  make  a  response  at  sll  under  these 
circumstances.  It  was  perhaps  as  s  result  of  this  thst  the  number  of  omissions  of 
complete  words  in  the  two  acoustic  presentations  amounted  to  only  18%  and  15%  of  the 
total  number  of  words  whilst  the  corresponding  figures  for  the  visual  tests  were  40% 
and  27%.  This  tendency  could  be  observed  even  for  the  words  thst  were  typed  correctly 
by  the  automatic  recogniser:  only  3%  were  omitted  entirely  in  the  acoustic  presen¬ 
tation  ss  compared  with  about  7%  in  the  visual  test. 

Evidence  for  the  greater  facility  of  making  phonemic  substitutions  in  the  scoustic 
form  of  presentation  is  the  response  made  by  subjects  to  words  correctly  typed  by  the 
recogniser.  For  example,  the  word  tool  was  typed  by  the  recogniser  without  mistake  so 
that  in  the  acoustic  test  it  was  hesrd  correctly  and  the  convention  for  transliteration 
wss  such  thst  the  word  was  even  presented  with  the  correct  spelling  in  the  visual  test. 
Nevertheless,  only  sbout  60%  of  the  subjects  in  the  acoustic  test  recognised  the  word 
correctly,  most  of  the  others  substituting  the  word  cool,  whilst  in  the  visual  test 
90%  responded  correctly. 

The  results  sre,  of  course,  affected  by  s  number  of  other  factors.  For  example 
the  words  used  are  not  very  homogeneous:  words  that  phonetically  or  in  spelling  are 
quite  close  to  each  other  might  differ  greatly  in  their  frequency  of  occurrence  in 
the  language  and  therefore  in  the  extent  to  which  subjects  expect  them.  For  instance, 
in  the  scoustic  test  the  word  meeker,  correctly  typed  by  the  recogniser,  was  identi¬ 
fied  correctly  only  43%  of  the  time  and  28%  of  the  time  as  meter,  whilst  the  word 
meter,  also  correctly  typed  by  the  recogniser  was  identified  82%  of  the  time;  in  the 
visual  tests  both  meeker  and  meter  were  recognised  correctly  only  sbout  55%  of  the 
time. 
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THE  INFLUENCE  OF  CONTEXT  ON  SUBJECTS’  ABILITY  TO  INTERPRET  THE  OUTPUT  OF  THE 
RECOGNISES 

The  lest  experiment  to  be  described  concerned  the  effect  of  the  subjects’  ex¬ 
pectations  on  their  ability  to  interpret  the  output  of  the  recogniser:  altering 
these  expectations  is  one  way  of  changing  the  linguistic  constraints  which  affected 
the  subject.  As  there  was  a  definite  limit  to  the  amount  of  constraint  that  could 
be  included  in  the  machine  it  was  of  great  interest  to  know  how  far  variations  of  the 
constraints  affecting  the  “ reader"  altered  the  overall  performance. 

The  words  of  List  3  were  used  in  the  experiment;  the  presentation  was  always  in 
the  visual  form  and  the  normal  I.P.A.  characters  were  used,  instead  of  the  ones  in 
the  previous  tests,  because  all  18  subjects  were  quite  fluent  in  the  use  of  these 
symbols.  Four  separate  lists  of  words  were  used.  The  first  one  was  the  entire  List 
3;  the  other  three  lists  consisted  of  words  selected  from  List  3,  words  whose  mean¬ 
ing  had  something  in  common.  The  words  of  one  list  all  had  something  to  do  with 
water,  of  the  other  one  with  humour  and  pastimes  and  those  of  the  third  list  were 
all  adjectives.  The  actual  words  in  these  three  lists  are  given  in  Table  15. 


The  output  of  the  recogniser  for  the  words  of  List  3  was  obtained  first.  This 
output  was  then  presented  to  the  subjects  in  separate  tests,  first  for  all  the  words 
of  List  3  and  then  for  the  words  of  the  selections  of  Table  15  in  turn.  They  were 
told  that  the  first  list  was  a  general  one,  whilst  the  meaning  of  the  words  of  the 
other  lists  had  something  in  common  as  shown  by  the  table  headings  and  they  were 
asked  to  write  down  in  ordinary  spelling  what  they  thought  the  words  were.  The  re¬ 
sponses  to  the  last  three  lists  were  then  scored  for  correct  recognition  and  this 
score  was  compared  with  that  obtained  for  the  same  words  in  the  general  list.  The 
results  show  that  the  scores  always  improved  when  the  extra  contextual  clue  was 
available  but  the  improvement  was  only  marginal  for  the  words  connected  with  water, 
257c  to  27%,  and  the  words  connected  with  humour  and  pastimes,  25%  to  26%.  The  im¬ 
provement  was  more  noticeable,  19%  to  30%,  for  the  list  containing  adjectives.  The 
results  must  naturally  be  highly  dependent  on  the  closeness  of  the  common  meaning 
of  the  words  in  any  one  list  and  the  scope  of  the  words  in  the  vocabulary  of  the 
recogniser  (List  3)  was  not  extensive  enough  to  make  up  really  satisfactory  groups 
of  words. 

The  question  of  the  effect  of  the  human  termination  on  the  operation  of  an 
automatic  speech  recogniser  is  obviously  an  important  one  and  much  further  work  is 
needed  to  investigate  it. 


! 


I 
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Table  18 


L*fta  of  words  whose  weanlncs  have  something  in  cowwon.  The  words  of  all 
three  lists  are  included  in  word  list  3. 


(a)  Words  to  do 
with  water 


(b )  Words  to  do 

with  humour  and 
pas  times 


(c)  Adjectives,  in¬ 
cluding  participles, 
excluding  nouns  that 
may  be  used  attri- 


butive ly 

calm 

mazurka 

neat 

kee  1 

art 

lean 

ark 

c ircus 

Zulu 

canoes 

teasers 

firmer 

sea 

farce 

Erse 

ee  1 

facetious 

neater 

leak 

shoot 

calm 

coot 

turn 

firm 

teal 

cartoon 

loose 

ooze 

fool 

seen 

marsh 

turf 

far 

canoe 

laugh 

two 

teem 

teaser 

terse 

surf 

saloon 

cooler 

shark 

tease 

cool 

seas 

looser 

tarn 

meek 

seal 

lunar 

mean 

aloof 

meeker 

curt 

tart 

facetious 

s 


CHAPTER  VI 


CONCLUSIONS 

On  reviewing  the  work  described  in  this  report  it  can  be  stated  that  an 
automatic  speech  recogniser,  utilising  both  acoustic  and  linguistic  information  in 
its  recognition  processes,  has  been  constructed.  The  circuitry  dealing  with  the 
recognition  of  the  acoustic  characteristics  searches  for  the  presence  of  well-known 
acoustic  correlates  of  the  phonemes  and  provides  information  about  the  probability  of 
occurrence  of  these  phonemes  from  the  acoustic  point  of  view.  The  store  of  linguistic 
knowledge  provides  an  estimate  of  the  linguistic  probability  of  the  occurrence  of  the 
phonemes.  The  recogniser  selects  that  phoneme  for  which  the  combined  probabilities 
are  greatest.  The  recogniser  can  deal  with  altogether  13  phonemes;  9  consonants  and 
4  vowels.  The  principal  aim  of  the  experiments  was  to  investigate  how  far  the  use  of 
linguistic  information  improves  the  performance  of  the  recognition  process.  Some  ad¬ 
ditional  experiments  were  carried  out  to  see  how  far  the  linguistic  knowledge  of  a 
human  observer  can  be  used  to  improve  the  performance  of  a  recogniser  when  he  is  asked 
to  interpret  its  output. 

The  results  of  the  experiments  show  that  the  use  of  even  a  very  limited  amount  of 
inguistic  information  does  help  in  the  recognition  process:  some  phoneme  sequences 
impossible  in  English  were  eliminated  from  the  output  and  the  overall  word  score  im¬ 
proved  by  50%  from  28%  to  43%.  The  results  also  show  however,  that  the  use  of 
m£uistic  information  can  make  the  results  worse  as  well  as  better:  once  a  mistake 
as  been  made,  the  wrong  kind  of  linguistic  information  is  utilised  and  a  further 
error  is  made  that  might  have  been  avoided  had  linguistic  information  not  been  used. 

T  e  detrimental  effect  of  this  procedure  was  minimised  by  restricting  the  speech 
material  to  words  spoken  in  isolation  and  the  silence  between  words  was  used  to  check, 
at  frequent  intervals,  that  the  correct  set  of  digram  frequencies  was  being  utilised. 

A  more  fundamental  way-ef-rectifying  this  kind  of  error  and  also  of  improving  the  per- 
ormance  of  the  recogniser  is  to  organise  the  linguistic  store  on  several  levels,  the 
phonemic  and  the  word  levels  for  example.  This  can  be  understood  best  by  comparing 
two  automatic  recognition  systems,  one  in  which  only  phonemes  and  phoneme  sequential 
probabilities  to  n  places  are  stored,  and  the  other  which  also  remembers  the  words  of 
the  language  it  deals  with,  the  words  being  stored  as  sequences  of  phonemes  up  to  n 
places.  In  the  system  which  operates  solely  with  phonemes,  successive  recognitions 
are  made  in  the  light  of  preceding  phonemes  only  and  once  made  cannot  be  corrected; 
if  an  error  is  made  it  will  prejudice  all  future  recognitions.  In  the  other  system  a 
w  ole  sequence  of  phonemes  is  recognised  only  provisionally  at  first  and  the  sequence 
is  compared  with  the  word  store  to  find  a  best  match.  The  final  decision  is  then 
reached  in  the  light  of  the  following  as  well  as  of  the  preceding  phonemes.  A  further 
advantage  of  the  second  system  is  that  the  output  is  necessarily  in  the  form  of 
words  whilst  the  purely  phonemic  system  can  produce  phoneme  sequences  that  do  not  form 
meaningful  words.  Once  a  multi-level  system  of  the  kind  just  described  has  been 
established,  its  performance  can  be  improved  further  by  making  the  phoneme  sequential 
probabilities  dependent  upon  preceding  word  recognitions.  Such  feed-back  of  in¬ 
formation  from  level  to  level  can  increase  the  constraints  considerably. 
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Phonemes  and  words  are,  of  course,  not  the  only  levels  of  linguistic  organisation 
that  can  be  included  in  the  linguistic  knowledge  of  a  recogniser.  Further  linguistic 
knowledge  based  on  word  transition  probabilities  and  on  a  sentence  store  would  also 
improve  the  performance.  The  use  of  such  knowledge,  particularly  of  an  adequate 
sentence  store,  would  require  very  considerable  storage  capacity.  A  more  modest, 
though  necessarily  less  all-embracing,  way  of  using  sentence  information  is  to  store 
enough  information  about  sentence  structure  for  enabling  the  machine  to  recognise  the 
syntactical  elements  of  the  input  sentence  and  then  to  modify  constraints  on  word  and 
phoneme  levels  accordingly. 


The  larger  linguistic  units  need  not  be  stored  solely  as  sequences  of  the  smaller 
units.  Acoustic  patterns  corresponding  to  the  larger  units  could  also  form  the  basis 
of  a  recognition  process.  This  would  mean  recognition  in  terms  of  the  longer  acoustic 
sequences  that  are  stored  for  the  larger  linguistic  units  and  it  can  be  expected  that 
this  would  offer  an  advantage  over  the  recognition  of  a  long  sequence  of  shorter  units 
as  used  for  phonemic  recognition,  because  of  the  acoustic  or  articulatory  constraints 
operative  in  speech. 


This  latter  question  is  just  a  small  detail  of  the  much  larger  problem  of 
deciding  whether  it  is  more  rewarding  to  improve  the  sophistication  of  the  acoustic 
recogniser  or  to  extend  the  linguistic  knowledge  of  the  machine.  This  question  can 
only  be  decided  on  empirical  rather  than  on  theoretical  grounds.  As  work  on  auto¬ 
matic  speech  recognition  progresses,  practical  systems  using  these  alternative  prin¬ 
ciples  of  recognition  and  giving  comparable  levels  of  performance  must  be  compared 
to  see  which  one  offers  greater  economy  of  instrumentation. 


So  far  in  the  discussion  it  has  always  been  assumed  that  the  only  way  of  in¬ 
creasing  the  linguistic  constraints  effective  in  the  recognition  process  is  to 
augment  the  linguistic  knowledge  stored  in  the  machine.  In  fact,  the  linguistic 
constraints  can  also  be  increased  by  restricting  the  variety  of  the  speech  material 
used  as  the  input  to  the  recogniser.  Depending  on  the  way  such  restrictions  are 
applied,^  they  can  increase  the  constraints  either  in  the  machine  or  in  the  human 
“reader"  of  the  output.  A  few  experiments  on  the  use  of  t.he  latter  of  these 
possibilities  have  already  been  described  in  this  report.  As  far  as  the  former  method 
is  concerned,  a  more  restricted  speech  input  will  simplify  the  task  of  the  machine 
not  only  because  of  the  smaller  number  of  choices  offered  but  also  because  by  suitable 
selection  the  variety  of  phonemes  and  of  phoneme  transition  probabilities  can  also  be 
reduced.  An  example  of  this  is  the  restriction  of  the  possible  words  to  those  which 
do  not  contain  consonant  clusters.  Some  indication  of  how  such  factors  operate  has 
been  given  in  the  experiments  described  above:  when  the  repertory  of  phonemes  was 
increased  from  11  to  13  the  phoneme  score  fell  from  75%  to  68%  and  the  word  score 
from  43%  to  35%. 
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It  seems  likely  that  automatic  recognisers  designed  for  a  considerably  restricted 
speech  material  will  acquire  practical  importance:  experimental  results  suggest  that 
whilst  not  enough  is  known  yet  to  make  recognisers  dealing  with  English  speech  gener- 
ally  a  practical  possibility,  it  is  possible  that  a  recogniser  designed  to  identify 
a  small  number,  say  up  to  30  or  50,  words  spoken  in  isolation  may  operate  successfully 
in  the  none  too  distant  future.  It  may  well  be  that  many  of  the  methods  used  in  such 
a  specialised  recogniser  offer  a  solution  that  is  relevant  only  to  the  restricted  in¬ 
put  condition;  it  is  equally  likely  though  that  such  machines  will  also  produce  some 
pointers  useful  for  the  solution  of  the  general  problem.  Therefore  the  design  of 
such  restricted  machines  should  be  interesting  from  the  theoretical  as  well  as  from 
the  practical  point  of  view. 

Most  future  experiments  on  automatic  speech  recognition  will  probably  require 
the  storage  of  considerable  amounts  of  information,  the  selective  use  of  such  informa¬ 
tion  and  the  making  of  decisions  dependent  on  a  variety  of  contingencies.  The  large 
digital  computers  available  commercially  offer  such  facilities.  These  conq>uters  are 
also  suitable  for  collecting  much  of  the  statistical  information,  about  both  the 
acoustic  and  linguistic  aspects  of  speech,  that  are  needed  for  various  automatic 
speech  recognition  processes  and  some  of  which  are  detailed  in  another  publication  (25). 
It  is  likely  therefore  that  computers  will  find  considerable  application  in  this 
branch  of  speech  research.  Although  they  are  expensive  to  rent,  the  alternative 
method  of  constructing  specialised  circuitry  to  perform  these  operations  would  be  even 
more  expensive  and  time  consuming.  It  is  hoped  that  by  using  computers  a  variety  of 
automatic  recognition  processes  can  be  tried  out,  evaluated  and  compared  in  a  rela¬ 
tively  short  time.  Whenever  a  method  of  recognition  of  practical  importunes  has  been 
found  it  should  not  be  too  difficult  to  transform  the  computer  programne  into  a  prac¬ 
tical  electronic  circuit  performing  the  same  function.  Work  on  finding  the  best  ways 
of  using  computers  for  research  on  speech  and  automatic  speech  recognition  is  already 
in  progress. 
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